Table of Contents
- cs.CL [Total: 27]
- cs.CV [Total: 73]
- eess.IV [Total: 4]
- cs.AI [Total: 3]
- cs.IR [Total: 1]
- cs.GR [Total: 1]
- cs.LG [Total: 6]
- cs.RO [Total: 2]
- cs.MM [Total: 2]
- cs.NE [Total: 1]
- cs.CR [Total: 1]
cs.CL [Back]
[1] SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models cs.CLPDF
Beny Rubinstein, Sergio Matos
TL;DR: 本文提出了SynDocDis框架,利用大语言模型和去标识化的病例元数据,生成高质量的、保护隐私的医生间临床讨论对话,以解决真实医生讨论数据难以获取的问题。
Details
Motivation: 解决因隐私和伦理限制而难以获取真实医生间病例讨论数据的问题,为医疗AI研究提供合规的合成数据源。
Result: 在九个肿瘤学和肝病学场景中,由五位执业医师评估,结果显示其沟通有效性(平均4.4/5)和医学内容质量(平均4.1/5)优异,临床相关性评分达91%,并具有较高的评分者间信度(kappa = 0.70)。
Insight: 创新点在于将结构化提示技术与去标识化的病例元数据相结合,专门针对医生间交流这一空白领域生成合成对话,在保护隐私的同时保证了临床准确性和实用性。
Abstract: Physician-physician discussions of patient cases represent a rich source of clinical knowledge and reasoning that could feed AI agents to enrich and even participate in subsequent interactions. However, privacy regulations and ethical considerations severely restrict access to such data. While synthetic data generation using Large Language Models offers a promising alternative, existing approaches primarily focus on patient-physician interactions or structured medical records, leaving a significant gap in physician-to-physician communication synthesis. We present SynDocDis, a novel framework that combines structured prompting techniques with privacy-preserving de-identified case metadata to generate clinically accurate physician-to-physician dialogues. Evaluation by five practicing physicians in nine oncology and hepatology scenarios demonstrated exceptional communication effectiveness (mean 4.4/5) and strong medical content quality (mean 4.1/5), with substantial interrater reliability (kappa = 0.70, 95% CI: 0.67-0.73). The framework achieved 91% clinical relevance ratings while maintaining doctors’ and patients’ privacy. These results place SynDocDis as a promising framework for advancing medical AI research ethically and responsibly through privacy-compliant synthetic physician dialogue generation with direct applications in medical education and clinical decision support.
[2] Medical Reasoning with Large Language Models: A Survey and MR-Bench cs.CL | cs.AIPDF
Xiaohan Ren, Chenxiao Fan, Wenyin Ma, Hongliang He, Chongming Gao
TL;DR: 本文对大型语言模型在医学推理领域的应用进行了全面综述,提出了基于认知理论的医学推理概念框架,将现有方法归纳为七种技术路线,并引入源自真实医院数据的MR-Bench基准进行统一评估,揭示了模型在考试任务与真实临床决策任务之间的性能差距。
Details
Motivation: 针对LLMs在临床部署中面临的安全性、上下文依赖性和证据动态性挑战,研究旨在系统评估其医学推理能力,而不仅仅是事实回忆,以弥合当前模型性能与真实世界临床推理需求之间的关键差距。
Result: 在统一实验设置下对代表性医学推理模型进行了跨基准评估,并在MR-Bench基准上测试显示,模型在真实临床决策任务上的准确性与考试级性能存在显著差距。
Insight: 创新点包括基于临床推理认知理论(溯因、演绎、归纳的迭代过程)构建医学推理概念框架,系统分类七种技术路线,以及引入真实临床数据衍生的MR-Bench基准,为评估提供了更贴近实际的标准化工具。
Abstract: Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.
[3] Neural networks for Text-to-Speech evaluation cs.CL | cs.AI | cs.SD | eess.ASPDF
Ilya Trofimenko, David Kocharyan, Aleksandr Zaitsev, Pavel Repnikov, Mark Levin
TL;DR: 本文提出并实现了一套新颖的神经网络模型,用于近似专家在相对(Side-by-Side)和绝对(Mean Opinion Score)设置下的判断,以解决文本转语音系统大规模质量评估的挑战。具体包括用于相对评估的NeuralSBS模型和用于绝对评估的改进版MOSNet及WhisperBert集成模型。
Details
Motivation: 解决传统TTS系统人工主观评估(如MOS和SBS)成本高、速度慢且易受评估者偏见影响的问题,旨在开发高效、可扩展的自动化评估方法。
Result: 在相对评估上,NeuralSBS在SOMOS数据集上达到73.7%的准确率;在绝对评估上,最佳MOS模型的RMSE约为0.40,显著优于人类评估者间0.62的RMSE基线。
Insight: 创新点包括提出基于HuBERT的NeuralSBS模型、通过定制序列长度批处理增强MOSNet,以及设计结合Whisper音频特征和BERT文本嵌入的多模态集成模型WhisperBert。客观分析表明,基于集成的堆叠方法优于直接的跨注意力融合,且专用度量学习框架比零样本LLM评估器更有效。
Abstract: Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that naively fusing text via cross-attention can degrade performance, highlighting the effectiveness of ensemble-based stacking over direct latent fusion. We additionally report negative results with SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), reinforcing the necessity of dedicated metric learning frameworks.
[4] Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models cs.CL | cs.AI | cs.LGPDF
Mousa Salah, Amgad Muneer
TL;DR: 本文系统评估了在扩展推理大语言模型中,采样温度(Temperature)与提示策略(零样本提示和思维链提示)的交互作用对数学问题解决性能的影响。研究发现,零样本提示在中等温度(T=0.4和0.7)下性能最佳,而思维链提示在温度极值(T=0.0和1.0)下表现最好,并且扩展推理的收益随温度升高而显著增加。
Details
Motivation: 扩展推理模型通过显式的测试时计算增强了LLM解决复杂问题的能力,但其采样温度和提示策略的最佳配置尚未得到充分探索。本文旨在研究这两者的联合优化,挑战了在推理任务中默认使用T=0的常见做法。
Result: 在具有挑战性的国际数学奥林匹克级别基准AMO-Bench的39个数学问题上,使用Grok-4.1模型进行测试。零样本提示在T=0.4和T=0.7时达到59%的准确率峰值;思维链提示在温度极值下表现最佳。扩展推理的收益从T=0.0时的6倍提升到T=1.0时的14.3倍。
Insight: 论文的创新点在于揭示了温度与提示策略之间存在强烈的交互作用,不能孤立优化。核心洞察是:对于扩展推理模型,应联合优化温度与提示策略,高温(如T=1.0)可能显著放大思维链等复杂提示策略的收益,这颠覆了在确定性推理任务中普遍使用低温(T=0)的惯例。
Abstract: Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.
[5] EXAONE 4.5 Technical Report cs.CLPDF
Eunbi Choi, Kibong Choi, Sehyun Chun, Seokhee Hong, Junwon Hwang
TL;DR: LG AI Research发布了首个开放权重的视觉语言模型EXAONE 4.5,该模型通过在EXAONE 4.0框架中集成专用视觉编码器,实现了视觉和文本模态的原生多模态预训练。模型经过大规模精心策划的数据训练,特别强调与LG战略应用领域对齐的文档中心语料,从而在文档理解和相关任务上取得显著性能提升,同时也在通用语言能力上带来广泛改进。模型支持长达256K tokens的上下文长度,以促进长上下文推理和企业级应用。评估表明,EXAONE 4.5在通用基准上具有竞争力,并在文档理解和韩语上下文推理方面超越了类似规模的SOTA模型。
Details
Motivation: 解决现有视觉语言模型在文档理解和特定领域(如韩语上下文推理)应用中的性能不足问题,并推动面向工业部署的实用AI发展。
Result: 在通用基准上达到竞争性性能,在文档理解和韩语上下文推理任务上超越了类似规模的SOTA模型。
Insight: 创新点包括:1) 通过集成专用视觉编码器实现原生多模态预训练;2) 针对文档中心语料进行针对性数据设计以提升领域性能;3) 扩展上下文长度至256K tokens以支持长上下文和企业级应用;4) 作为开放权重模型,便于社区扩展和应用到更多领域。
Abstract: This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG’s strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG’s ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.
[6] Decomposing the Delta: What Do Models Actually Learn from Preference Pairs? cs.CL | cs.AIPDF
Chia-Hsuan Lee, Mingyang Zhou, Renkun Ni, Zelei Cheng, Sihui Dai
TL;DR: 本文研究了偏好优化方法(如DPO和KTO)中偏好数据对提升语言模型推理能力的关键因素,区分了生成器级差异(源于生成被选和拒绝推理轨迹的模型能力差异)和样本级差异(源于单个偏好对中评判质量差异),发现增加生成器级差异能持续提升跨领域推理任务性能,而基于样本级差异筛选数据可实现更高效训练。
Details
Motivation: 旨在探究偏好数据中哪些具体属性驱动了下游推理性能的提升,以理解偏好优化方法如何有效对齐语言模型。
Result: 实验表明,增加生成器级差异能稳步提升跨领域推理任务性能;利用样本级差异筛选训练数据可实现更高的数据效率训练。
Insight: 提出了提升推理性能的双重策略:在构建偏好对时最大化生成器级差异,并利用样本级差异筛选信息量最大的训练样本;创新点在于对偏好数据中“差异”的细粒度分解与实证研究,为高效偏好数据构建提供了新视角。
Abstract: Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair improve a reasoning model’s performance on general reasoning tasks? We investigate two distinct notions of quality delta in preference data: generator-level delta, arising from the differences in capability between models that generate chosen and rejected reasoning traces, and sample-level delta, arising from differences in judged quality differences within an individual preference pair. To study generator-level delta, we vary the generator’s scale and model family, and to study sample-level delta, we employ an LLM-as-a-judge to rate the quality of generated traces along multiple reasoning-quality dimensions. We find that increasing generator-level delta steadily improves performance on out-of-domain reasoning tasks and filtering data by sample-level delta can enable more data-efficient training. Our results suggest a twofold recipe for improving reasoning performance through preference optimization: maximize generator-level delta when constructing preference pairs and exploit sample-level delta to select the most informative training examples.
[7] MedConceal: A Benchmark for Clinical Hidden-Concern Reasoning Under Partial Observability cs.CLPDF
Yikun Han, Joey Chan, Jingyuan Chen, Mengting Ai, Simo Du
TL;DR: MedConceal是一个用于评估医疗对话中隐藏担忧推理能力的基准测试,包含300个精选病例和600次临床医生-LLM交互。它通过一个交互式患者模拟器,在部分可观测性下评估对话系统如何通过多轮对话引出和解决患者的潜在担忧。
Details
Motivation: 现有医疗对话基准大多忽略了部分可观测性挑战,要么直接暴露患者隐藏状态,要么将引出过程简化为信息提取。该论文旨在解决这一缺陷,为评估临床隐藏担忧推理提供一个更真实的测试环境。
Result: 在MedConceal基准上,前沿模型在不同确认指标上表现领先,但人类临床医生(N=159)在干预成功率上仍保持最强。结果表明,没有单一系统在所有方面占优,部分可观测性下的隐藏担忧推理是医疗对话系统尚未解决的关键挑战。
Insight: 创新点在于构建了一个基于专家分类法的、临床医生审查过的交互式患者模拟器,能够基于理论驱动的轮次级通信信号跟踪隐藏担忧的揭示和解决过程,实现了对任务成功及其交互过程的流程感知评估。这为评估对话系统的确认和干预能力提供了新框架。
Abstract: Patient-clinician communication is an asymmetric-information problem: patients often do not disclose fears, misconceptions, or practical barriers unless clinicians elicit them skillfully. Effective medical dialogue therefore requires reasoning under partial observability: clinicians must elicit latent concerns, confirm them through interaction, and respond in ways that guide patients toward appropriate care. However, existing medical dialogue benchmarks largely sidestep this challenge by exposing hidden patient state, collapsing elicitation into extraction, or evaluating responses without modeling what remains hidden. We present MedConceal, a benchmark with an interactive patient simulator for evaluating hidden-concern reasoning in medical dialogue, comprising 300 curated cases and 600 clinician-LLM interactions. Built from clinician-answered online health discussions, each case pairing clinician-visible context with simulator-internal hidden concerns derived from prior literature and structured using an expert-developed taxonomy. The simulator withholds these concerns from the dialogue agent, tracks whether they have been revealed and addressed via theory-grounded turn-level communication signals, and is clinician-reviewed for clinical plausibility. This enables process-aware evaluation of both task success and the interaction process that leads to it. We study two abilities: confirmation, surfacing hidden concerns through multi-turn dialogue, and intervention, addressing the primary concern and guiding the patient toward a target plan. Results show that no single system dominates: frontier models lead on different confirmation metrics, while human clinicians (N=159) remain strongest on intervention success. Together, these results identify hidden-concern reasoning under partial observability as a key unresolved challenge for medical dialogue systems.
[8] Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching cs.CL | cs.AI | cs.DB | cs.MA | cs.SCPDF
Cyrus Zhou, Yufei Jin, Yilin Xu, Yu-Chiang Wang, Chieh-Ju Chao
TL;DR: 本文提出了一种名为SatIR的可扩展临床试验检索方法,该方法基于约束满足理论,利用可满足性模理论(SMT)和关系代数来高效表示和匹配临床试验与患者记录中的关键约束,并借助大语言模型(LLMs)将非正式的临床推理转化为明确、可控的形式化约束,从而实现了高精度、可解释的患者与试验匹配。
Details
Motivation: 现有基于关键词和嵌入相似度的临床试验检索方法在处理复杂约束时,常常面临召回率低、精确度低和可解释性差的问题,导致临床试验招募困难。本文旨在解决这些问题,提升患者与相关试验匹配的效率和效果。
Result: 在59名患者和3,621项试验的数据集上评估,SatIR在所有三个评估的检索目标上均优于TrialGPT:为每位患者多检索出32%-72%的相关且符合条件的试验;在有用试验的并集上,召回率提升了22-38个百分点;为更多患者提供了至少一项有用试验。检索速度快,每名患者处理3,621项试验仅需2.95秒。
Insight: 创新点在于将形式化方法(SMT、关系代数)与LLMs结合,将临床文本中的模糊性、隐含假设和不完整信息转化为明确、可控的形式化约束,从而实现了高召回、高精度、可扩展且可解释的检索。这为处理复杂、结构化信息的检索任务提供了一种新范式。
Abstract: Clinical trials are central to evidence-based medicine, yet many struggle to meet enrollment targets, despite the availability of over half a million trials listed on ClinicalTrials.gov, which attracts approximately two million users monthly. Existing retrieval techniques, largely based on keyword and embedding-similarity matching between patient profiles and eligibility criteria, often struggle with low recall, low precision, and limited interpretability due to complex constraints. We propose SatIR, a scalable clinical trial retrieval method based on constraint satisfaction, enabling high-precision and interpretable matching of patients to relevant trials. Our approach uses formal methods – Satisfiability Modulo Theories (SMT) and relational algebra – to efficiently represent and match key constraints from clinical trials and patient records. Beyond leveraging established medical ontologies and conceptual models, we use Large Language Models (LLMs) to convert informal reasoning regarding ambiguity, implicit clinical assumptions, and incomplete patient records into explicit, precise, controllable, and interpretable formal constraints. Evaluated on 59 patients and 3,621 trials, SatIR outperforms TrialGPT on all three evaluated retrieval objectives. It retrieves 32%-72% more relevant-and-eligible trials per patient, improves recall over the union of useful trials by 22-38 points, and serves more patients with at least one useful trial. Retrieval is fast, requiring 2.95 seconds per patient over 3,621 trials. These results show that SatIR is scalable, effective, and interpretable.
[9] GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification cs.CLPDF
Faxian Wan, Xiaocui Yang, Yifan Cao, Shi Feng, Daling Wang
TL;DR: 本文提出GRASP框架,通过结合视觉定位与显式思维链推理,解决多模态讽刺目标识别任务中细粒度目标定位的挑战。该方法构建了MSTI-MAX数据集以缓解类别不平衡问题,并采用双阶段优化策略提升模型性能。
Details
Motivation: 传统多模态讽刺检测仅进行二元分类,而多模态讽刺目标识别需要精确定位文本短语和视觉区域等细粒度目标,现有方法依赖隐式跨模态对齐,可解释性差且定位效果不佳。
Result: 实验表明,GRASP在多模态细粒度讽刺目标识别上优于现有基线,并通过LLM-as-a-Judge评估定量衡量了内部推理链的质量。
Insight: 创新点包括引入显式锚定讽刺相关视觉区域的Grounded CoT推理机制,以及采用坐标感知加权损失和监督微调与细粒度目标策略优化的双阶段联合优化策略。
Abstract: Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.
[10] MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits cs.CL | cs.IRPDF
Yixin Xiang, Yunshan Ma, Xiaoyu Du, Yibing Chen, Yanxin Zhang
TL;DR: 本文提出了一种基于多臂老虎机的文档问答框架MAB-DQA,旨在解决多模态检索增强生成中因仅检索少数候选页面而忽略信息丰富但视觉显著性低的内容的问题。该方法将查询分解为多个方面感知的子查询,动态评估各子查询的重要性并重新分配检索预算,从而更有效地利用大量页面图像生成答案。
Details
Motivation: 在多模态检索增强生成中,视觉文档问答难以有效利用大量图像,因为检索阶段通常只保留少数候选页面,导致信息丰富但视觉显著性低的内容被忽视,而常见但信息量低的页面被优先选择。
Result: 在四个基准测试上,MAB-DQA相比最先进的方法平均提升了5%-18%,持续增强了文档理解能力。
Insight: 创新点在于将查询分解为方面感知的子查询,并利用多臂老虎机框架动态建模查询中多个隐含方面的重要性,通过探索-利用策略重新分配检索预算,以更有效地整合信息丰富的页面及其关联性。
Abstract: Document Question Answering (DQA) involves generating answers from a document based on a user’s query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval-Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top-4), causing informative but less visually salient content to be overlooked in favor of common yet low-information pages. To address this issue, we propose a Multi-Armed Bandit-based DQA framework (MAB-DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB-DQA decomposes a query into aspect-aware subqueries and retrieves an aspect-specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration-exploitation policy, MAB-DQA dynamically reallocates retrieval budgets toward high-value aspects. With the most informative pages and their correlations, MAB-DQA generates the expected results. On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding. Code at https://github.com/ElephantOH/MAB-DQA.
[11] Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models cs.CLPDF
Shun Zou, Yong Wang, Zehui Chen, Lin Chen, Chongyang Tao
TL;DR: 本文提出了一种名为锚点历史稳定解码(AHD)的训练无关、即插即用动态解码策略,用于解决扩散大语言模型(dLLMs)中半自回归解码固有的块边界约束问题。AHD通过动态锚点实时监控token的稳定性趋势,一旦token达到稳定便启动早期跨块解码,从而在语言、视觉语言和音频语言等多个领域的实验中同时提升了模型性能和推理效率。
Details
Motivation: 半自回归解码在基础dLLMs和高级解码策略中因其优越性能而被广泛采用,但作者观察到其存在固有的块约束,导致许多跨块稳定token的解码被不必要地延迟,因此需要解决这一效率瓶颈。
Result: 在BBH基准测试上,该方法将解码步骤减少了80%,同时性能提升了3.67%。广泛的跨领域实验表明,AHD同时提升了性能和推理效率,并有效逆转了现有高级解码加速策略中常见的性能下降现象。
Insight: 论文的创新点在于系统性地研究了稳定token的识别,并基于三个关键发现(朴素前瞻解码不可靠、token稳定性与收敛趋势密切相关、历史信息被孤立)提出了AHD策略。其核心是通过动态锚点监控稳定性趋势,实现早期跨块解码,这是一种无需训练、可即插即用的高效解码方法。
Abstract: Diffusion Large Language Models (dLLMs) have recently become a promising alternative to autoregressive large language models (ARMs). Semi-autoregressive (Semi-AR) decoding is widely employed in base dLLMs and advanced decoding strategies due to its superior performance. However, our observations reveal that Semi-AR decoding suffers from inherent block constraints, which cause the decoding of many cross-block stable tokens to be unnecessarily delayed. To address this challenge, we systematically investigate the identification of stable tokens and present three key findings: (1) naive lookahead decoding is unreliable, (2) token stability closely correlates with convergence trend, and (3) historical information is isolated. Building on these insights, we propose Anchor-based History-stable Decoding (AHD), a training-free, plug-and-play dynamic decoding strategy. Specifically, AHD monitors the stability trend of tokens in real time through dynamic anchors. Once a token reaches stability, it initiates early cross-block decoding to enhance efficiency and performance. Extensive experiments across language, vision-language, and audio-language domains demonstrate that AHD simultaneously improves both performance and inference efficiency. Notably, AHD effectively reverses the performance degradation typically observed in existing advanced decoding acceleration strategies. For instance, on the BBH benchmark, our approach reduces decoding steps by 80% while improving performance by 3.67%.
[12] Litmus (Re)Agent: A Benchmark and Agentic System for Predictive Evaluation of Multilingual Models cs.CL | cs.AI | cs.HC | cs.MAPDF
Avni Mittal, Shanu Kumar, Sandipan Dandapat, Monojit Choudhury
TL;DR: 本文提出了一个用于预测性多语言评估的基准测试和智能体系统,旨在解决多语言模型部署中因评测覆盖稀疏、证据不完整而难以直接评估目标语言任务性能的问题。
Details
Motivation: 动机是解决多语言模型部署中常见的预测性评估问题,即当目标语言缺乏直接基准测试结果时,如何基于不完整的文献证据来估计模型在该语言任务上的性能。
Result: 在包含1,500个问题、覆盖六个任务和五种证据场景的受控基准测试中,所提出的Litmus (Re)Agent智能体系统在六种系统中取得了最佳整体性能,尤其在直接证据薄弱或缺失的迁移密集型场景中提升最大。
Insight: 创新点在于构建了一个将查询分解为假设、检索证据并通过特征感知聚合合成预测的DAG编排智能体系统,证明了结构化智能体推理是在不完整证据下进行多语言性能估计的有效方法。
Abstract: We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and published evidence is uneven across languages, tasks, and model families. We introduce a controlled benchmark of 1,500 questions spanning six tasks and five evidence scenarios. The benchmark separates accessible evidence from ground truth, enabling evaluation of systems that must infer missing results from incomplete literature evidence. We also present Litmus (Re)Agent, a DAG-orchestrated agentic system that decomposes queries into hypotheses, retrieves evidence, and synthesises predictions through feature-aware aggregation. Across six systems, Litmus (Re)Agent achieves the best overall performance, with the largest gains in transfer-heavy scenarios where direct evidence is weak or absent. These results show that structured agentic reasoning is a promising approach to multilingual performance estimation under incomplete evidence.
[13] PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment cs.CL | cs.AIPDF
Jihwan Oh, Soowon Oh, Murad Aghazada, Minchan Jeong, Sungnyun Kim
TL;DR: 本文提出PerMix-RLVR方法,旨在解决大语言模型在基于可验证奖励的强化学习(RLVR)训练中,对人物角色提示的敏感性与角色表达忠实度之间的权衡问题。该方法通过在训练中混合不同人物角色,既保持了模型对有害角色变化的鲁棒性,又能在需要时忠实采纳角色特征。
Details
Motivation: 现有的人物角色提示方法虽能引导大语言模型行为,但寻找最优角色耗时且效果不稳定,且先前工作多在推理时通过提示策略解决,带来额外计算开销。本文旨在在训练阶段解决角色敏感性,使模型能适应多样角色同时保持任务性能。
Result: 在MATH500基准上,PerMix-RLVR将人物角色稳定性分数(PSS)较RLVR提升21.2%;在PersonaGym基准上,人物角色忠实度提升11.4%。
Insight: 创新点在于提出人物角色混合的RLVR策略(PerMix-RLVR),揭示了基于结果的优化存在鲁棒性与表达忠实度的固有权衡,并通过训练时混合角色来缓解这一矛盾,实现模型对角色变化既鲁棒又忠实的自适应能力。
Abstract: Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.
[14] ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering cs.CL | cs.AI | cs.LGPDF
Xiaoke Guo, Songze Li, Zhiqiang Liu, Zhaoyan Gong, Yuanxiang Liu
TL;DR: 本文提出了一种名为ASTRA的自适应语义树推理架构,用于解决复杂表格问答任务中表格序列化的瓶颈问题。该架构包含两个核心模块:AdaSTR模块利用大语言模型的全局语义感知能力,将表格重构为逻辑语义树,以显式建模层次依赖关系并基于表格规模自适应优化构建策略;DuTR模块则是一个双模式推理框架,结合了基于树搜索的文本导航(用于语言对齐)和符号代码执行(用于精确验证)。
Details
Motivation: 现有表格序列化方法在复杂表格问答中存在结构忽视、表示差距和推理不透明等挑战,无法捕获显式层次结构且缺乏模式灵活性,而现有的基于树的方法则存在语义适应性有限的问题。本文旨在解决这些局限性。
Result: 在复杂表格基准测试上的实验表明,该方法取得了最先进的性能。
Insight: 论文的创新点在于提出了一个结合了自适应语义树构建(AdaSTR)和双模式推理(DuTR)的完整架构。从客观角度看,其核心创新在于将表格的层次结构显式建模为逻辑语义树,并设计了一种结合了文本导航(软推理)和符号执行(硬验证)的混合推理策略,这有助于同时处理语义理解和精确计算,从而提升复杂表格问答的鲁棒性和准确性。
Abstract: Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.
[15] CONDESION-BENCH: Conditional Decision-Making of Large Language Models in Compositional Action Space cs.CL | cs.AIPDF
Yeonjun Hwang, Sungyong Park, Minju Kim, Dongha Lee, Jinyoung Yeo
TL;DR: 本文提出了CONDESION-BENCH基准,用于评估大语言模型在组合动作空间中的条件决策能力,旨在克服现有基准中动作候选集有限且忽略显式条件限制的不足。
Details
Motivation: 现有决策基准假设动作来自预定义的有限集合,且未将限制动作可行性的显式条件纳入决策过程,这无法捕捉现实世界动作的组合结构和约束条件。
Result: 通过基于oracle的决策质量和条件遵守度评估,该基准为LLM作为决策支持工具提供了更严格的评估框架。
Insight: 创新点在于将动作定义为对决策变量的分配,并在变量、上下文和分配级别引入显式条件约束,从而更真实地模拟组合决策空间。
Abstract: Large language models have been widely explored as decision-support tools in high-stakes domains due to their contextual understanding and reasoning capabilities. However, existing decision-making benchmarks rely on two simplifying assumptions: actions are selected from a finite set of pre-defined candidates, and explicit conditions restricting action feasibility are not incorporated into the decision-making process. These assumptions fail to capture the compositional structure of real-world actions and the explicit conditions that constrain their validity. To address these limitations, we introduce CONDESION-BENCH, a benchmark designed to evaluate conditional decision-making in compositional action space. In CONDESION-BENCH, actions are defined as allocations to decision variables and are restricted by explicit conditions at the variable, contextual, and allocation levels. By employing oracle-based evaluation of both decision quality and condition adherence, we provide a more rigorous assessment of LLMs as decision-support tools.
[16] NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System cs.CL | cs.AI | cs.LGPDF
Parjanya Aditya Shukla, Shubham Kumar Nigam, Debtanu Datta, Balaramamahanthi Deepak Patnaik, Noel Shallum
TL;DR: 本文提出了NyayaMind,一个用于印度司法系统的透明法律推理和判决预测的开源框架。该框架集成了检索、推理和验证机制,模拟法院的结构化决策过程,旨在提升判决预测的解释质量和证据对齐。
Details
Motivation: 解决法院判决预测与解释任务中,现有方法在生成透明、结构化且符合司法实践的法律推理方面的不足,以构建可信的AI辅助法律决策支持系统。
Result: 广泛的实验和专家评估表明,与现有的CJPE方法相比,NyayaMind显著提高了解释质量和证据对齐。
Insight: 创新点在于将检索增强生成与针对印度法律领域微调的推理导向大语言模型相结合,构建了一个模拟法院结构化决策过程的集成框架,强调透明性和可扩展性。
Abstract: Court Judgment Prediction and Explanation (CJPE) aims to predict a judicial decision and provide a legally grounded explanation for a given case based on the facts, legal issues, arguments, cited statutes, and relevant precedents. For such systems to be practically useful in judicial or legal research settings, they must not only achieve high predictive performance but also generate transparent and structured legal reasoning that aligns with established judicial practices. In this work, we present NyayaMind, an open-source framework designed to enable transparent and scalable legal reasoning for the Indian judiciary. The proposed framework integrates retrieval, reasoning, and verification mechanisms to emulate the structured decision-making process typically followed in courts. Specifically, NyayaMind consists of two main components: a Retrieval Module and a Prediction Module. The Retrieval Module employs a RAG pipeline to identify legally relevant statutes and precedent cases from large-scale legal corpora, while the Prediction Module utilizes reasoning-oriented LLMs fine-tuned for the Indian legal domain to generate structured outputs including issues, arguments, rationale, and the final decision. Our extensive results and expert evaluation demonstrate that NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, providing a promising step toward trustworthy AI-assisted legal decision support systems.
[17] Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency cs.CLPDF
Shu Yang, Zihao Zhou, Di Wang, Wenda Li
TL;DR: 本文提出了一种神经符号层次对齐(NSHA)方法,用于解决大语言模型在处理来自不同权威级别(如系统策略、用户请求、工具输出和检索上下文)的指令时可能出现的良性指令冲突问题。该方法通过显式建模和执行指令优先级,在推理时使用求解器引导的推理将指令解析转化为约束满足问题,以在层次约束下推导出最大一致的可应用指令集;在训练时,则利用自动构建的监督将基于求解器的决策蒸馏到模型参数中。
Details
Motivation: 现有关于指令层次的研究主要关注对抗性攻击,而忽略了现实应用中常见且良性的指令冲突。在这些场景下,模型不仅需要避免安全违规,还必须在指令部分或隐式冲突时保持任务效用和行为一致性。
Result: 在规则遵循、任务执行、工具使用和安全性等多个方面(涵盖单轮和多轮交互)的评估表明,NSHA在存在指令冲突的情况下显著提升了性能,同时在参考设置中保持了有竞争力的效用。
Insight: 创新点在于将层次指令遵循问题形式化为约束满足问题,并采用神经符号方法(结合求解器引导推理和知识蒸馏)来显式建模指令优先级,从而在冲突场景下实现逻辑一致的行为。这为处理复杂、多源指令的LLM系统提供了一种系统性的解决方案。
Abstract: Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver-based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single-turn and multi-turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.
[18] Think Less, Know More: State-Aware Reasoning Compression with Knowledge Guidance for Efficient Reasoning cs.CLPDF
Yi Sui, Chaozhuo Li, Dawei Song
TL;DR: 本文提出了一种名为STACK(State-Aware Reasoning Compression with Knowledge Guidance)的框架,旨在解决大型推理模型(LRMs)在长思维链(CoT)推理中存在的过度思考问题,通过状态感知的逐步压缩和知识引导,在保持准确性的同时显著提升推理效率。
Details
Motivation: 大型推理模型在复杂任务上依赖长思维链,但常因过度思考导致推理步骤过多、推理延迟高;现有思维链压缩方法难以平衡准确性与效率,且缺乏对冗余和推理偏差的细粒度、步骤级适应。
Result: 在三个数学推理基准测试上的实验表明,STACK实现了优越的准确率-效率平衡,相比现有方法,平均响应长度减少了59.9%,同时准确率提升了4.8个百分点。
Insight: 创新点在于显式建模阶段特定的冗余源,并结合检索增强的知识引导进行逐步压缩;通过在线长短对比样本构建和动态切换压缩策略(知识引导压缩与自提示压缩),并辅以基于答案收敛的早停机制;采用结合PPO和DPO的奖励差异驱动训练策略,使模型学习状态条件化的压缩策略。
Abstract: Large Reasoning Models (LRMs) achieve strong performance on complex tasks by leveraging long Chain-of-Thought (CoT), but often suffer from overthinking, leading to excessive reasoning steps and high inference latency. Existing CoT compression methods struggle to balance accuracy and efficiency, and lack fine-grained, step-level adaptation to redundancy and reasoning bias. Therefore, we propose State-Aware Reasoning Compression with Knowledge Guidance (STACK), a framework that performs step-wise CoT compression by explicitly modeling stage-specific redundancy sources and integrating with a retrieval-augmented guidance. STACK constructs online long-short contrastive samples and dynamically switches between knowledge-guided compression for uncertain or biased reasoning state and self-prompted compression for overly long but confident state, complemented by an answer-convergence-based early stopping mechanism to suppress redundant verification. We further propose a reward-difference-driven training strategy by combining Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), enabling models to learn state-conditioned compression strategies. Experiments on three mathematical reasoning benchmarks show that STACK achieves a superior accuracy-efficiency balance, reducing average response length by 59.9% while improving accuracy by 4.8 points over existing methods.
[19] Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG cs.CLPDF
Passant Elchafei, Monorama Swain, Shahed Masoudian, Markus Schedl
TL;DR: 本文提出了一种面向问答系统的分面级诊断框架,用于追踪RAG系统中证据不确定性和幻觉的来源。该框架将问题分解为原子推理分面,并通过构建‘分面×文本块’矩阵来评估每个分面的证据充分性和忠实性。通过对比严格RAG、宽松RAG和纯LLM生成三种推理模式,系统性地诊断了检索与生成之间的错位问题。
Details
Motivation: 现有RAG评估多关注答案级或段落级准确性,难以深入揭示生成过程中证据是如何被使用的。即使检索到相关文档,幻觉答案依然普遍存在,因此需要更细粒度的诊断工具来分析证据整合失败的根本原因。
Result: 在医疗QA和HotpotQA数据集上,对GPT、Gemini和LLaMA等开源和闭源LLM的评估表明,RAG系统的幻觉主要源于生成过程中对检索证据的整合失败,而非检索准确性本身。分面级分析揭示了答案级评估中隐藏的系统性证据缺失、错位和先验知识覆盖等失败模式。
Insight: 创新点在于提出了一个细粒度的分面级诊断框架,通过结构化矩阵量化证据的充分性与忠实性,并设计了三种可控推理模式来隔离和诊断‘检索-生成错位’问题。这为理解和改进RAG系统的证据整合机制提供了可解释的分析工具。
Abstract: Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.
[20] Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies cs.CL | cs.AI | cs.LGPDF
Avni Mittal
TL;DR: 本文提出符号-神经一致性审计(SNCA)框架,用于评估大型语言模型(LLM)是否遵循其自我声明的安全策略。该框架通过结构化提示提取模型自述的安全规则,将其形式化为类型化谓词,并通过与有害基准的确定性比较来测量行为合规性。对四个前沿模型在45个有害类别和47,496次观察中的评估显示,模型声明的策略与观察行为之间存在系统性差距。
Details
Motivation: LLM通过RLHF内化安全策略,但这些策略从未被正式规定且难以检查。现有基准根据外部标准评估模型,但无法衡量模型是否理解并执行其自我声明的边界。
Result: 评估发现:声称绝对拒绝的模型经常遵守有害提示;推理模型的自洽性最高,但无法为29%的类别阐明策略;跨模型在规则类型上的一致性极低(11%)。这些结果表明模型言行差距是可测量且与架构相关的。
Insight: 创新点在于提出SNCA框架,首次系统性地审计LLM自我声明策略与行为的一致性,揭示了模型安全策略的形式化与执行之间的脱节,为行为基准提供了补充性评估方法。
Abstract: LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model’s self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.
[21] ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery cs.CLPDF
Shahar Levy, Eliya Habba, Reshef Mintz, Barak Raveh, Renana Keydar
TL;DR: ScheMatiQ是一个利用大型语言模型(LLM)的系统,旨在将自然语言研究问题转化为结构化数据。它通过交互式模式发现,从大规模文档集合中自动生成标注模式(schema)和基于证据的数据库,并提供一个Web界面供用户引导和修订提取过程。该系统在法学和计算生物学领域与领域专家合作验证了其实际分析支持能力,并已开源发布。
Details
Motivation: 解决传统方法中,为回答基于大型文档集的自然语言研究问题,需要手动设计标注模式并进行详尽标注所带来的缓慢且易出错的问题。
Result: 在法学和计算生物学领域的真实世界分析中,ScheMatiQ生成的输出被证明能够支持实际分析需求,展示了其应用有效性。
Insight: 创新点在于将LLM与交互式界面结合,自动化并引导从研究问题到结构化证据数据库的生成流程,降低了领域专家获取结构化数据的门槛,并提供了可干预的修订机制。
Abstract: Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com
[22] Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM cs.CL | cs.LGPDF
Solomiia Bilyk, Volodymyr Getmanskyi, Taras Firman
TL;DR: 本文研究了自动化指令修订(AIR),一种基于规则归纳的方法,用于利用有限的任务特定示例使大语言模型适应下游任务。论文将AIR置于更广泛的适应策略(包括提示优化、检索方法和微调)中进行比较,并在一套旨在强调不同任务需求(如知识注入、结构化提取、标签重映射和逻辑推理)的多样化基准测试套件上评估这些方法。研究发现,适应性能高度依赖于任务类型,没有单一方法在所有设置中占优。
Details
Motivation: 研究动机是探索和比较不同的大语言模型任务适应策略,以理解在何种任务场景下何种方法(如AIR、检索或微调)最为有效,从而为实际应用中的方法选择提供指导。
Result: 在五个基准测试中,AIR在标签重映射分类任务上表现最强或接近最佳;KNN检索在闭卷问答任务上表现最佳;而微调在结构化提取和事件顺序推理任务上占主导地位。结果表明,性能表现是任务依赖的。
Insight: 论文宣称的创新点在于提出了AIR方法,并进行了系统性的结构化比较。客观分析认为,其核心洞察在于明确了不同任务适应方法的适用边界:当任务行为可由紧凑、可解释的指令规则捕获时,AIR最有前景;而在由特定源知识或数据集特定标注规律主导的任务中,检索和微调则更强。这为方法选择提供了清晰的决策依据。
Abstract: This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.
[23] UIPress: Bringing Optical Token Compression to UI-to-Code Generation cs.CLPDF
Dasen Dai, Shuoqi Li, Ronghao Chen, Huacan Wang, Biao Wu
TL;DR: UIPress是一种针对UI到代码生成任务的轻量级光学令牌压缩方法,通过深度可分离卷积、元素引导的空间重加权和Transformer精炼模块,将约6700个视觉令牌压缩至固定的256个,结合解码器的低秩适应(LoRA)来弥合表示差距,在Design2Code基准上实现了CLIP分数0.8127,比未压缩基线提升7.5%,同时带来9.1倍的首令牌生成加速。
Details
Motivation: 现有UI到代码生成方法在视觉令牌压缩上存在不足:推理时基于任务无关启发式选择令牌或仅零化低注意力特征,无法真正减少预填充延迟或适应UI截图的非均匀信息密度;光学(编码器端学习)压缩在文档OCR中表现良好,但尚未应用于UI到代码生成任务。
Result: 在Design2Code基准上,UIPress在256令牌预算下达到CLIP分数0.8127,优于未压缩基线7.5%,强于最佳推理时方法4.6%,同时实现9.1倍的首令牌生成速度提升。
Insight: 创新点包括:首次将编码器端学习压缩引入UI到代码生成;结合深度可分离卷积、元素引导空间重加权和Transformer精炼的轻量级压缩模块;仅增加约2170万可训练参数(占基础模型0.26%),实现高效压缩与性能提升的平衡。
Abstract: UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence – neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5% and the strongest inference-time method by +4.6%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.
[24] From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models cs.CLPDF
Chenchen Zhang
TL;DR: 本文系统综述了2024年至2026年初发表的47种信用分配方法,针对大语言模型强化学习中从推理到智能体交互场景下的稀疏奖励分配难题,提出了一个二维分类法,并贡献了结构化文献库、报告清单和基准协议等资源。
Details
Motivation: 解决大语言模型强化学习中,由于稀疏结果级奖励和长轨迹(如思维链生成或多轮交互)导致的信用分配难题,该问题在推理RL和智能体RL两种场景下尤为突出。
Result: 本文是一项综述研究,未报告具体定量结果,但通过分析指出推理RL的信用分配方法正围绕过程奖励模型和无评论者群体比较趋于成熟,而智能体RL则催生了事后反事实分析、特权非对称评论者和轮级MDP重构等全新方法。
Insight: 创新点在于提出了一个按分配粒度(如词元、片段、步骤)和方法论(如蒙特卡洛、时序差分)组织的二维分类法,并提供了可重复使用的结构化资源,揭示了从推理到智能体RL的转变如何重塑信用分配的研究格局,推动了针对长序列、部分可观测环境的新方法发展。
Abstract: Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards – yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500–30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K–1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches – hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations – that have no direct precedent in reasoning RL.
[25] RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval cs.CL | cs.AI | cs.IR | cs.LGPDF
Kyle Whitecross, Negin Rahimi
TL;DR: 本文提出了RecaLLM,一种经过后训练以有效利用长上下文信息的推理语言模型。它通过交替进行推理和显式的上下文检索来解决“迷失在思考中”的现象,即在推理步骤后上下文检索性能会显著下降的问题。模型引入了一种开销极小的约束解码机制来逐字复制证据片段,从而提升后续生成的准确性。
Details
Motivation: 动机是解决大型语言模型中推理与上下文检索交互不足的问题,特别是“迷失在思考中”现象——推理步骤虽然提升性能,却使后续的上下文检索变得更加困难,这限制了测试时上下文的扩展。
Result: 在RULER和HELMET两个长上下文基准测试上,RecaLLM表现强劲,显著优于基线模型。值得注意的是,它仅使用最多10K tokens的训练样本,就在长达128K tokens的上下文窗口中实现了持续的性能提升,这优于现有需要昂贵长上下文训练数据的方法。
Insight: 创新点在于将推理与显式上下文检索交替进行,并引入约束解码机制以精确复制证据。从客观角度看,该方法提供了一种无需昂贵长数据即可提升长上下文性能的可行路径,通过专注于检索-推理的交互来优化模型效率。
Abstract: We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.
[26] Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation cs.CL | cs.HCPDF
Xinyu Wang, Sai Koneru, Wenbo Zhang, Wenliang Zheng, Saksham Ranjan
TL;DR: 本文针对AI生成策略驱动的假新闻检测问题,提出了MANYFAKE基准数据集,包含6,798篇通过多种策略提示生成的假新闻文章,用于评估现有假新闻检测模型在混合真实信息场景下的性能。
Details
Motivation: 现有假新闻检测研究多视为二分类问题,而现代假新闻常通过人机协作生成,将策略性虚假信息嵌入真实可信的叙述中,现有基准缺乏对此类混合真实案例的充分覆盖。
Result: 在MANYFAKE基准上评估多种SOTA假新闻检测模型,结果显示即使具备推理能力的先进模型在完全虚构故事上接近饱和,但在虚假信息微妙、优化且与准确信息交织时仍表现脆弱。
Insight: 创新点在于构建了策略驱动的多样化假新闻生成基准,揭示了当前检测模型对混合真实信息的脆弱性,强调了需要超越二分类、应对复杂人机协作生成内容的检测方法。
Abstract: Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mixed-truth cases represent a realistic and consequential threat, yet they remain underrepresented in existing benchmarks. To address this gap, we introduce MANYFAKE, a synthetic benchmark containing 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture many ways fake news can be constructed and refined. Using this benchmark, we evaluate a range of state-of-the-art fake news detectors. Our results show that even advanced reasoning-enabled models approach saturation on fully fabricated stories, but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.
[27] Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision cs.CL | cs.AI | cs.IR | cs.LGPDF
Soroosh Tayebi Arasteh, Mehdi Joodaki, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn
TL;DR: 本文提出了一个名为’案例驱动的证据验证’的通用框架,旨在解决证据驱动推理中模型决策未能真正依赖证据的问题。该框架的核心贡献是一种监督构建方法,能够自动生成支持性示例以及语义可控的非支持性示例,无需人工标注证据。作者在放射学领域实例化了该框架,并训练了一个标准验证器,结果表明该模型能真正依赖证据进行决策。
Details
Motivation: 当前证据驱动推理的实践存在缺陷:监督信号弱、证据与主张关联松散、评估未能直接测试证据依赖性。论文旨在构建一个框架,使模型在给定具体案例上下文、外部证据和结构化主张时,能够判断证据是否支持该主张,从而解决模型决策未能真正因果依赖于证据的问题。
Result: 在放射学领域实例化的验证器显著优于仅使用案例或仅使用证据的基线模型。在证据正确时模型表现强劲,当证据被移除或替换时模型性能崩溃,表明其具有真正的证据依赖性。该行为能够迁移到未见过的证据文章和外部案例分布上,但在证据来源发生偏移时性能会下降,并且对主干模型的选择仍敏感。
Insight: 主要创新点在于提出了一种无需人工证据标注的监督构建程序,能自动生成包含反事实错误状态和主题相关负例的语义可控非支持性示例。客观分析认为,该研究的关键洞见在于揭示了证据驱动推理的主要瓶颈不仅是模型容量,更在于缺乏能够编码证据因果作用的监督信号,这为构建更可靠的证据敏感模型提供了新思路。
Abstract: Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.
cs.CV [Back]
[28] Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach cs.CV | cs.AI | cs.LGPDF
Ponkoj Chandra Shill
TL;DR: 本文提出了一种用于数字取证中仇恨与威胁检测的案例驱动多模态方法,通过明确区分嵌入文本、关联上下文文本和纯图像证据,并基于证据配置选择性应用文本分析、多模态融合或纯视觉语义推理,从而模拟取证决策过程并提高证据可追溯性。
Details
Motivation: 解决数字取证调查中异质证据(如图像、扫描文档和上下文报告)包含的显性或隐性伤害表达检测问题,现有自动化方法常假设干净文本输入或未经取证论证地应用视觉模型,缺乏对证据来源和配置的细致区分。
Result: 在法证风格图像证据上的实验评估表明,该方法在异质证据场景中表现出一致且可解释的行为,但摘要未提及具体基准测试或定量结果(如准确率、F1分数)及与SOTA的比较。
Insight: 创新点在于将取证决策逻辑形式化为证据配置的条件推理框架,通过显式确定文本证据的存在与来源(嵌入/关联/无文本),避免不合理的模态假设,提高检测过程的透明度和可追溯性;客观分析认为其结合ViT骨干的视觉语言模型进行选择性多模态融合,为领域自适应提供了新思路。
Abstract: Digital forensic investigations increasingly rely on heterogeneous evidence such as images, scanned documents, and contextual reports. These artifacts may contain explicit or implicit expressions of harm, hate, threat, violence, or intimidation, yet existing automated approaches often assume clean text input or apply vision models without forensic justification. This paper presents a case-driven multimodal approach for hate and threat detection in forensic analysis. The proposed framework explicitly determines the presence and source of textual evidence, distinguishing between embedded text, associated contextual text, and image-only evidence. Based on the identified evidence configuration, the framework selectively applies text analysis, multimodal fusion, or image-only semantic reasoning using vision language models with vision transformer backbones (ViT). By conditioning inference on evidence availability, the approach mirrors forensic decision-making, improves evidentiary traceability, and avoids unjustified modality assumptions. Experimental evaluation on forensic-style image evidence demonstrates consistent and interpretable behavior across heterogeneous evidence scenarios.
[29] ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction cs.CVPDF
Kun Wang, Yupeng Hu, Zhiran Li, Hao Liu, Qianlong Xiang
TL;DR: 本文介绍了在CVPR 2026 NTIRE视频显著性预测挑战赛中获得冠军的解决方案ViSAGE。该方案是一个多专家集成框架,通过自适应门控和调制机制来精炼时空特征,并融合不同专家的互补预测,以捕捉视频中复杂的时空显著性线索。
Details
Motivation: 动机在于利用互补的归纳偏置来解决视频显著性预测问题,旨在更好地捕捉视频中复杂的时空显著性线索。
Result: 在挑战赛的私有测试集上,ViSAGE在四项评估指标中的两项上排名第一,并在其余两项指标上超越了大多数竞争对手,证明了其有效性和泛化能力。
Insight: 创新点在于提出了一个多专家集成框架(ViSAGE),通过自适应门控和调制机制来精炼特征,并融合不同专家的预测,从而聚合多样化的归纳偏置来提升视频显著性预测性能。
Abstract: In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at https://github.com/iLearn-Lab/CVPRW26-ViSAGE.
[30] MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments cs.CV | cs.AIPDF
Xingming Liao, Ning Chen, Muying Shu, Yunpeng Yin, Peijian Zeng
TL;DR: MARINER是一个基于实体-环境-事件(3E)范式构建的综合性海事基准测试,包含16,629张多源海事图像,涵盖63个细粒度船舶类别、多样恶劣环境和5种典型动态海事事件,用于评估细粒度分类、目标检测和视觉问答任务。
Details
Motivation: 由于缺乏专门的基准测试,现实世界开放水域环境中的细粒度视觉理解和高级推理研究不足,该论文旨在填补这一空白。
Result: 对主流多模态大语言模型(MLLMs)的广泛评估表明,即使是先进模型在复杂海洋场景的细粒度判别和因果推理方面也存在困难,并为此建立了基线。
Insight: 创新点在于提出了首个专门针对海事领域的3E驱动基准测试,强调了对现实、认知层面评估的需求,并推动了面向开放水域应用的鲁棒视觉-语言模型研究。
Abstract: Fine-grained visual understanding and high-level reasoning in real-world open-water environments remain under-explored due to the lack of dedicated benchmarks. We introduce MARINER, a comprehensive benchmark built under the novel Entity-Environment-Event (3E) paradigm. MARINER contains 16,629 multi-source maritime images with 63 fine-grained vessel categories, diverse adverse environments, and 5 typical dynamic maritime incidents, covering fine-grained classification, object detection, and visual question answering tasks. We conduct extensive evaluations on mainstream Multimodal Large language models (MLLMs) and establish baselines, revealing that even advanced models struggle with fine-grained discrimination and causal reasoning in complex marine scenes. As a dedicated maritime benchmark, MARINER fills the gap of realistic and cognitive-level evaluation for maritime multimodal understanding, and promotes future research on robust vision-language models for open-water applications. Appendix and supplementary materials are available at https://lxixim.github.io/MARINER.
[31] 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding cs.CV | cs.AI | cs.LG | cs.ROPDF
Makanjuola Ogunleye, Eman Abdelrahman, Ismini Lourentzou
TL;DR: 本文提出了3D-VCD,一种用于缓解3D具身智能体幻觉的推理时视觉对比解码框架。该方法通过对以物体为中心的三维场景表示(如场景图)施加语义和几何扰动来构建扭曲的三维上下文,并通过对比原始与扭曲上下文下的预测来抑制那些不受真实场景证据影响、可能由语言先验驱动的token,从而提升基于三维环境的推理可靠性。
Details
Motivation: 大型多模态模型作为在三维环境中操作的具身智能体的推理核心时,仍然容易产生幻觉,导致不安全和不接地的决策。现有的推理时幻觉缓解方法主要针对2D视觉语言设置,无法迁移到具身三维推理,因为后者的失败源于物体存在性、空间布局和几何接地问题,而非像素级不一致。
Result: 在3D-POPE和HEAL基准测试上评估3D-VCD,结果表明该方法无需任何重新训练即可持续改善接地推理,确立了在结构化三维表示上进行推理时对比解码是通向更可靠具身智能的有效且实用的途径。
Insight: 创新点在于首次将推理时对比解码思想应用于三维具身智能体的幻觉缓解,其核心是通过构造语义和几何扰动的三维场景图来创建负样本,从而在解码阶段直接抑制对场景证据不敏感的生成。这为提升三维环境下的模型可靠性提供了一种无需重新训练、即插即用的实用方法。
Abstract: Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.
[32] InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation cs.CVPDF
Zhefan Rao, Bin Zou, Haoxuan Che, Xuanhua He, Chong Hou Choi
TL;DR: 本文提出了InsEdit,一种基于指令的视频编辑模型,通过数据高效的方法将视频生成模型(HunyuanVideo-1.5)适配为视频编辑器。该方法结合了视觉编辑架构和基于互上下文注意力(MCA)的视频数据管道,仅需约10万条视频编辑数据即可在视频指令编辑基准上达到开源方法中的最先进水平,并同时支持图像编辑。
Details
Motivation: 解决基于指令的视频编辑通常需要大量视频编辑数据的问题,同时高质量视频编辑数据稀缺,旨在使视频生成模型能够以数据高效的方式成为强大的视频编辑器。
Result: 在视频指令编辑基准上,仅使用约10万条视频编辑数据,InsEdit在开源方法中达到了最先进水平(SOTA)。
Insight: 创新点在于结合视觉编辑架构与基于互上下文注意力(MCA)的数据管道,该管道能创建对齐的视频对,允许编辑从视频片段中间开始而非仅从第一帧,从而提高了数据效率和编辑灵活性;同时,通过训练方案整合图像编辑数据,使模型无需修改即可支持图像编辑。
Abstract: Instruction-based video editing is a natural way to control video content with text, but adapting a video generation model into an editor usually appears data-hungry. At the same time, high-quality video editing data remains scarce. In this paper, we show that a video generation backbone can become a strong video editor without large scale video editing data. We present InsEdit, an instruction-based editing model built on HunyuanVideo-1.5. InsEdit combines a visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin in the middle of a clip rather than only from the first frame. With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on our video instruction editing benchmarks. In addition, because our training recipe also includes image editing data, the final model supports image editing without any modification.
[33] Unified Multimodal Uncertain Inference cs.CV | cs.LGPDF
Dengjia Zhang, Alexander Martin, William Jurayj, Kenton Murray, Benjamin Van Durme
TL;DR: 本文提出了统一多模态不确定推理(UMUI)任务,涵盖文本、音频和视频,要求模型基于任意模态或组合的前提生成假设的校准概率估计。为了解决该任务,作者构建了一个人工标注的评估集,并提出了CLUE方法,结合自洽教师校准和基于分布的置信度探测来产生校准预测。实验表明,其3B参数模型在所有模态上均达到或优于高达32B参数的基线模型。
Details
Motivation: 现有不确定推理研究主要局限于文本,其他模态的扩展仅限于单模态二元蕴含判断,缺乏跨模态的细粒度概率推理框架。
Result: 在构建的涵盖音频、视觉和视听场景的人工标注评估集以及现有文本和音频基准测试上,提出的3B参数CLUE模型在所有模态上均达到或优于高达32B参数的基线模型。
Insight: 创新点在于定义了统一的多模态不确定推理任务,并提出了结合自洽教师校准和分布置信度探测的CLUE方法,实现了参数效率高且跨模态性能优越的校准概率估计。
Abstract: We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.
[34] Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring cs.CV | cs.AI | cs.ROPDF
Xinmiao Xiong, Bangya Liu, Hao Wang, Dayou Li, Nuo Chen
TL;DR: 本文提出LeanGate,一种轻量级前馈帧门控网络,用于加速基于Transformer的单目SLAM系统。通过预测几何效用分数,在昂贵的几何基础模型特征提取和匹配之前评估帧的建图价值,从而跳过超过90%的冗余帧,显著减少计算开销。
Details
Motivation: 现有基于几何基础模型的SLAM系统在密集视频流上部署时存在显著计算冗余,因为它们依赖后验关键帧选择,导致在确定帧是否包含新几何信息前仍需执行昂贵的密集几何解码,造成计算浪费和延迟拒绝。
Result: 在标准SLAM基准测试中,LeanGate将跟踪FLOPs减少超过85%,实现端到端吞吐量5倍加速,同时保持与密集基线相当的跟踪和建图精度。
Insight: 创新点在于提出预测性即插即用模块,在早期阶段评估帧的几何效用,避免不必要的重型计算,实现了高效的计算资源利用而不牺牲SLAM性能。
Abstract: Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine whether a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame’s mapping value prior to the heavy GFM feature extraction and matching stages. As a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines.
[35] LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving cs.CV | cs.AI | cs.ROPDF
Hao Shao, Letian Wang, Yang Zhou, Yuxuan Hu, Zhuofan Zong
TL;DR: LMGenDrive是一个端到端自动驾驶框架,首次将基于LLM的多模态理解与生成式世界模型相结合,能够根据多视角摄像头输入和自然语言指令,同时生成未来驾驶视频和控制信号。
Details
Motivation: 为了解决自动驾驶在长尾和开放世界场景中的泛化瓶颈,并受人类统一理解和想象能力的启发,旨在构建一个统一模型来结合场景理解与未来想象,以提升决策的鲁棒性和泛化性。
Result: 在具有挑战性的闭环基准测试中,LMGenDrive显著优于先前方法,在指令跟随、时空理解和罕见场景鲁棒性方面有明显提升。
Insight: 核心创新点在于将LLM的强语义先验和指令理解能力与生成式世界模型的时空场景建模能力互补结合,并提出了一种渐进式的三阶段训练策略以提升稳定性和性能,为构建更通用、鲁棒的具身决策系统提供了新方向。
Abstract: Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. We further propose a progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, to improve stability and performance. LMGenDrive supports both low-latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.
[36] AI Driven Soccer Analysis Using Computer Vision cs.CV | cs.AIPDF
Adrian Manchado, Tanner Cellio, Jonathan Keane, Yiyang Wang
TL;DR: 该论文提出了一种基于计算机视觉的AI驱动足球分析方法,通过结合目标检测、关键点检测和分割跟踪技术,将比赛视频中的球员位置和球场关键点从摄像机视角转换到真实世界坐标,从而计算球员速度、跑动距离、位置热图等战术洞察数据。
Details
Motivation: 体育分析对球队表现至关重要,但传统视频分析难以提取复杂特征和精确的量化数据。论文旨在利用计算机视觉自动从比赛录像中识别和跟踪关键实体(如球员),并将其映射到真实球场坐标,以提供更深入、可操作的性能数据。
Result: 论文评估了YOLO和Faster RCNN等目标检测模型在自定义视频素材上的准确性,并计划与SAM2结合进行分割和跟踪。关键点检测使用CNN模型,通过单应性变换将摄像机视角转换为真实地面视角,从而计算球员速度、距离覆盖和位置热图等指标。
Insight: 创新点在于将目标检测(YOLO/Faster RCNN)、分割模型(SAM2)和关键点检测(CNN)通过单应性变换集成到一个流程中,实现摄像机视角到真实世界坐标的鲁棒转换,从而自动化生成以往标准视频分析无法提供的复杂战术统计数据。
Abstract: Sport analysis is crucial for team performance since it provides actionable data that can inform coaching decisions, improve player performance, and enhance team strategies. To analyze more complex features from game footage, a computer vision model can be used to identify and track key entities from the field. We propose the use of an object detection and tracking system to predict player positioning throughout the game. To translate this to positioning in relation to the field dimensions, we use a point prediction model to identify key points on the field and combine these with known field dimensions to extract actual distances. For the player-identification model, object detection models like YOLO and Faster R-CNN are evaluated on the accuracy of our custom video footage using multiple different evaluation metrics. The goal is to identify the best model for object identification to obtain the most accurate results when paired with SAM2 (Segment Anything Model 2) for segmentation and tracking. For the key point detection model, we use a CNN model to find consistent locations in the soccer field. Through homography, the positions of points and objects in the camera perspective will be transformed to a real-ground perspective. The segmented player masks from SAM2 are transformed from camera perspective to real-world field coordinates through homography, regardless of camera angle or movement. The transformed real-world coordinates can be used to calculate valuable tactical insights including player speed, distance covered, positioning heatmaps, and more complex team statistics, providing coaches and players with actionable performance data previously unavailable from standard video analysis.
[37] State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition cs.CVPDF
Bryan Cheng, Austin Jin, Jasper Zhang
TL;DR: 本文提出PHONSSM模型,通过利用手语的音系组合性结构(如手形、位置、动作、方向等离散参数)来解决手语识别中的词汇规模扩展难题,在仅使用骨骼数据的情况下,在大型ASL数据集上取得了显著性能提升,并展示了出色的少样本和零样本迁移能力。
Details
Motivation: 现有手语识别模型将手势视为原子视觉模式,学习扁平表示,无法利用手语系统的组合结构,导致在词汇量扩大时性能崩溃,因此需要引入能反映语言结构的组合归纳偏置。
Result: 在迄今最大的ASL数据集(5,565个手势)上,仅使用骨骼数据,PHONSSM在WLASL2000上达到72.1%准确率(比骨骼SOTA提升18.4个百分点),超越多数RGB方法;少样本场景下相对提升225%,并在ASL Citizen数据集上零样本迁移超过有监督RGB基线。
Insight: 创新点包括:通过基于解剖结构的图注意力强制音系分解、显式正交子空间因子化以及支持少样本迁移的原型分类;核心洞察是词汇扩展瓶颈本质上是表示学习问题,可通过模仿语言结构的组合归纳偏置解决。
Abstract: Sign language recognition suffers from catastrophic scaling failure: models achieving high accuracy on small vocabularies collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages-systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across the vocabulary. We introduce PHONSSM, enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization into orthogonal subspaces, and prototypical classification enabling few-shot transfer. Using skeleton data alone on the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% on WLASL2000 (+18.4pp over skeleton SOTA), surpassing most RGB methods without video input. Gains are most dramatic in the few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.
[38] InstrAct: Towards Action-Centric Understanding in Instructional Videos cs.CV | cs.AIPDF
Zhuoyi Yang, Jiapeng Yu, Reuben Tan, Boyang Li, Huijuan Xu
TL;DR: 本文提出InstrAction框架,旨在提升教学视频中动作中心理解的能力,通过数据清洗、动作感知特征提取及多任务学习,解决现有视频基础模型因静态偏见和噪声监督导致的动作识别困难。
Details
Motivation: 现有视频基础模型在处理教学视频时,因依赖网络噪声标注和静态物体线索而难以识别细粒度动作和建模时序关系,本文旨在克服这一‘静态偏见’问题。
Result: 在提出的InstrAct Bench评估基准上,该方法在语义推理、程序逻辑和细粒度检索任务上均超越当前最先进的视频基础模型,达到SOTA水平。
Insight: 创新点包括:数据驱动的噪声过滤与动作中心负样本生成、动作感知器提取运动相关token、以及动态时间规整对齐和掩码动作建模等辅助目标,这些方法可有效增强跨模态对齐和时序建模能力。
Abstract: Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive “static bias”, where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos’ action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method consistently outperforms state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.
[39] Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models cs.CVPDF
Sumra Khan, Sagar Chhabriya, Aizan Zafar, Sheeraz Arif, Amgad Muneer
TL;DR: 本文提出了一种上下文对齐的医学视觉-语言模型推理框架,旨在通过整合多种临床证据(如影像组学统计、可解释性激活和基于词汇的语义线索)来增强模型诊断结论的可靠性和可解释性。该框架强制要求异质证据之间达成一致后才生成诊断结论,并输出结构化结果(包括支持证据、不确定性估计、局限性和安全说明),从而减少幻觉并提升推理的简洁性。
Details
Motivation: 解决现有医学视觉-语言模型在放射学任务中过度依赖单一主导模态,导致生成流畅但缺乏依据的结论的问题,旨在提升医学多模态推理的责任性、可靠性和可解释性。
Result: 在胸部X光数据集上的实验表明,上下文对齐将判别性能从AUC 0.918提升至0.925,同时保持校准的不确定性;显著减少了幻觉关键词(从1.14降至0.25),并使推理解释更简洁(从19.4词降至15.3词)而未增加模型置信度(0.70至0.68)。在CheXpert上的跨数据集评估进一步揭示了模态信息量对推理行为的影响。
Insight: 创新点在于提出了一个上下文对齐的推理框架,通过整合并验证多种结构化上下文信号来强制多证据一致性,从而提升模型的可靠性和可信度,同时保持底层模型架构不变;其结构化输出(包含证据、不确定性等)为构建负责任、可解释的医学AI系统提供了可借鉴的设计范式。
Abstract: Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.
[40] SenBen: Sensitive Scene Graphs for Explainable Content Moderation cs.CV | cs.AI | cs.LG | cs.MMPDF
Fatih Cagatay Akyon, Alptekin Temizel
TL;DR: 该论文提出了SenBen,首个用于敏感内容的大规模场景图基准,包含13,999个电影帧的详细标注。作者通过蒸馏前沿视觉语言模型,开发了一个紧凑的241M参数学生模型,采用多任务训练方法解决自回归场景图生成中的词汇不平衡问题,在敏感内容检测和场景图生成任务上实现了高性能、高效率。
Details
Motivation: 现有内容审核系统仅能对图像进行安全/不安全分类,缺乏空间定位和可解释性,无法解释检测到的敏感行为、涉及对象及其位置。
Result: 在SenBen基准上,该学生模型的召回率比标准交叉熵训练提高了6.4个百分点;在场景图生成指标上,其性能优于除Gemini模型外的所有评估VLM和商业安全API,同时实现了所有模型中最高目标检测和图像描述分数,推理速度快7.6倍,GPU内存占用少16倍。
Insight: 创新点包括:1) 首个面向敏感内容的场景图基准;2) 通过基于后缀的对象标识、词汇感知召回损失和解耦的Query2Label标签头等多任务方法,有效解决自回归场景图生成中的词汇不平衡问题;3) 模型在保持高性能的同时显著提升了效率。
Abstract: Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.
[41] BIAS: A Biologically Inspired Algorithm for Video Saliency Detection cs.CVPDF
Zhao-ji Zhang, Ya-tang Li
TL;DR: 本文提出了BIAS,一种受生物学启发的快速动态视觉显著性检测模型,用于连续视频流。该模型基于Itti-Koch框架,结合了视网膜启发的运动检测器来提取时序特征,从而生成融合静态和运动信息的显著性图。通过贪婪多高斯峰值拟合算法识别注意力焦点,平衡赢家通吃竞争与信息最大化。BIAS在DHF1K数据集上优于启发式方法和多个深度学习模型,尤其在自下而上注意力主导的视频中表现突出,并在交通事故分析中展示了实际应用价值,实现了最先进的因果识别性能,并能提前0.72秒预测事故。
Details
Motivation: 解决动态视频中快速、可解释的显著性检测问题,结合生物启发模型以提高计算效率和实时性,弥补传统方法和深度学习在速度和生物合理性方面的不足。
Result: 在DHF1K数据集上,BIAS超越了启发式方法和多个深度学习模型,特别是在自下而上注意力主导的视频中;在交通事故分析中,实现了最先进的因果识别性能,并能提前0.72秒预测事故,准确率可靠。
Insight: 创新点包括将视网膜启发的运动检测器集成到Itti-Koch框架中,以融合静态和动态特征;使用贪婪多高斯峰值拟合算法平衡竞争与信息最大化,提高注意力焦点识别的准确性;模型在保持生物合理性的同时实现了毫秒级延迟,为实时应用提供了高效解决方案。
Abstract: We present BIAS, a fast, biologically inspired model for dynamic visual saliency detection in continuous video streams. Building on the Itti–Koch framework, BIAS incorporates a retina-inspired motion detector to extract temporal features, enabling the generation of saliency maps that integrate both static and motion information. Foci of attention (FOAs) are identified using a greedy multi-Gaussian peak-fitting algorithm that balances winner-take-all competition with information maximization. BIAS detects salient regions with millisecond-scale latency and outperforms heuristic-based approaches and several deep-learning models on the DHF1K dataset, particularly in videos dominated by bottom-up attention. Applied to traffic accident analysis, BIAS demonstrates strong real-world utility, achieving state-of-the-art performance in cause-effect recognition and anticipating accidents up to 0.72 seconds before manual annotation with reliable accuracy. Overall, BIAS bridges biological plausibility and computational efficiency to achieve interpretable, high-speed dynamic saliency detection.
[42] Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance cs.CVPDF
Enyi Shi, Fei Shen, Shuyi Miao, Linxia Zhu, Pengyang Shao
TL;DR: 本文提出Precise Shield框架,通过识别并约束影响VLLM安全性的关键神经元,以防御多语言多模态复合攻击。该方法首先对比有害与良性输入的激活模式来定位安全神经元,然后通过梯度掩码将参数更新严格限制在该子空间内,仅影响少于0.03%的参数,从而在提升安全性的同时保持模型的多语言多模态泛化能力。
Details
Motivation: 解决现实部署中VLLM面临的多语言多模态复合攻击(如低资源语言文本与有害图像组合)易绕过现有防御的问题,并探究安全能力在模型中的具体实现机制及其跨语言跨模态的分布。
Result: 方法在提升安全性方面效果显著,同时保持了多语言和多模态的泛化性能;分析表明安全神经元在语言和模态间存在适度重叠,支持零样本跨语言跨模态安全能力迁移。
Insight: 创新点在于将纯文本LLM中发现的跨语言共享安全神经元机制扩展到VLLM,提出通过神经元级引导和极稀疏参数更新(<0.03%)来实现精准安全对齐,为基于神经元迁移的安全增强提供了新方向。
Abstract: In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.
[43] HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing cs.CV | cs.AIPDF
Xinyu Zhang, Zurong Mai, Qingmei Li, Zjin Liao, Yibin Wen
TL;DR: 本文提出了首个针对高光谱遥感图像理解的多模态大语言模型基准测试HM-Bench,包含大规模问答数据集和双模态评估框架,用于系统评估模型在高光谱图像感知与推理任务上的能力。
Details
Motivation: 现有MLLMs在自然图像理解上进展显著,但在处理具有高维度和复杂光谱-空间特性的高光谱遥感图像方面能力尚未充分探索,缺乏专门的评估基准。
Result: 在包含13个任务类别、19,337个问答对的数据集上,对18个代表性MLLMs进行了广泛评估,结果显示模型在处理复杂光谱-空间推理任务时存在显著困难,且视觉输入通常优于文本输入。
Insight: 创新点在于构建了首个高光谱遥感MLLM基准,并提出了将高光谱数据转换为基于PCA的合成图像和结构化文本报告的双模态评估框架,以系统比较不同表征对模型性能的影响,强调了光谱-空间证据对于有效理解的重要性。
Abstract: While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.
[44] GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing cs.CVPDF
Aoran Xiao, Shihao Cheng, Yonghao Xu, Yexian Ren, Hongruixuan Chen
TL;DR: 本文针对地理科学与遥感领域多模态大语言模型发展的挑战,提出了GeoMMBench基准测试和GeoMMAgent多智能体框架。GeoMMBench是一个全面的多模态问答基准,覆盖了遥感领域的多个学科、传感器和任务,用于评估模型在领域知识、感知基础和推理方面的能力。通过评估36个开源和专有模型,揭示了现有模型在专家级地理空间解释能力上的系统性不足。GeoMMAgent则是一个多智能体框架,通过整合领域特定的遥感模型和工具,战略性地结合检索、感知和推理,显著提升了处理复杂地理科学和遥感挑战的能力。
Details
Motivation: 地理科学与遥感领域的多模态大语言模型发展面临独特挑战,包括广泛的学科知识、异构的传感器模态和碎片化的任务谱系。现有基准测试和模型在这些方面存在局限,无法满足专家级地理空间解释的需求。
Result: 在GeoMMBench基准上评估了36个开源和专有大型语言模型,揭示了它们在领域知识、感知基础和推理方面的系统性缺陷。实验结果表明,GeoMMAgent框架显著优于独立的LLMs,强调了工具增强智能体在动态应对复杂地理科学和遥感挑战中的重要性。
Insight: 论文的创新点在于提出了一个专门针对地理科学与遥感领域的综合性多模态问答基准(GeoMMBench),以及一个通过整合领域特定模型和工具来增强检索、感知和推理能力的多智能体框架(GeoMMAgent)。这为领域导向的AI发展提供了新的评估范式和解决方案,强调了工具集成和领域知识在提升专家级多模态智能中的关键作用。
Abstract: Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models, uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning–capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.
[45] Large-Scale Universal Defect Generation: Foundation Models and Datasets cs.CV | cs.AIPDF
Yuanting Fan, Jun Liu, Bin-Bin Gao, Xiaochen Chen, Yuhuan Lin
TL;DR: 本文提出了UDG大规模数据集和UniDG通用缺陷生成基础模型,解决了现有缺陷生成方法因数据稀缺导致的过拟合、泛化能力差和真实感不足的问题。UniDG支持基于参考和文本指令的缺陷编辑,无需针对每个类别进行微调,通过自适应裁剪、结构化输入和两阶段训练策略,在多个基准测试中超越了现有方法。
Details
Motivation: 现有缺陷生成方法依赖少样本学习,由于缺乏大规模配对缺陷编辑数据,容易过拟合到特定缺陷类别,且缺陷尺度和形态的显著变化导致泛化能力有限、真实感下降和类别一致性差。
Result: 在MVTec-AD和VisA基准测试上的大量实验表明,UniDG在合成质量和下游单类/多类异常检测与定位任务上,超越了先前的少样本异常生成以及图像插入/编辑基线方法。
Insight: 创新点包括:1) 引入大规模UDG数据集(30万四元组数据);2) 提出UniDG通用基础模型,支持参考和文本指令两种编辑模式;3) 采用缺陷上下文编辑(自适应裁剪和结构化双联画输入)和MM-DiT多模态注意力融合条件;4) 设计两阶段训练策略(多样性监督微调后接一致性强化微调)以平衡多样性与真实性。从客观角度看,该工作将缺陷生成从特定类别少样本学习范式转向大规模通用基础模型,具有显著的范式创新和实用价值。
Abstract: Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference-based defect generation and text instruction-based defect editing without per-category fine-tuning. UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec-AD and VisA show that UniDG outperforms prior few-shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single- and multi-class anomaly detection/localization. Code will be available at https://github.com/RetoFan233/UniDG.
[46] TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction cs.CVPDF
Ao Li, Yonggen Ling, Yiyang Lin, Yuji Wang, Yong Deng
TL;DR: 论文提出了TAIHRI,一个专为近距离人机交互设计的视觉语言模型,旨在通过理解用户的运动指令,将机器人注意力引导至任务相关的关键身体部位,实现精确的3D关键点定位,并适应下游任务如自然语言控制或人体网格恢复。
Details
Motivation: 解决传统3D人体关键点估计方法主要关注全身重建质量,而在实际人机交互场景中,机器人更需要在以自我为中心的相机3D坐标系下,精确度量任务相关身体部位的空间定位问题。
Result: 在以自我为中心的交互基准测试中,TAIHRI在任务关键身体部位的估计精度上达到了优越水平。
Insight: 创新点在于首次将视觉语言模型定制化用于近距离HRI感知,通过将3D关键点量化为有限交互空间,并利用下一令牌预测进行2D关键点推理,从而精确定位关键身体部位的3D空间坐标,实现了任务感知的3D人体关键点定位。
Abstract: Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventional 3D human keypoints estimation methods primarily focus on the whole-body reconstruction quality relative to the root joint. However, in practical human-robot interaction (HRI) scenarios, robots are more concerned with the precise metric-scale spatial localization of task-relevant body parts under the egocentric camera 3D coordinate. We propose TAIHRI, the first Vision-Language Model (VLM) tailored for close-range HRI perception, capable of understanding users’ motion commands and directing the robot’s attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localize the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction, and seamlessly adapt to downstream tasks such as natural language control or global space human mesh recovery. Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts. We believe TAIHRI opens new research avenues in the field of embodied human-robot interaction. Code is available at: https://github.com/Tencent/TAIHRI.
[47] Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios cs.CVPDF
Yu Shi, Yu Liu, Zhong-Cheng Wu, Juan Cheng, Huafeng Li
TL;DR: 本文提出了一种高效且对退化敏感的扩散框架,用于在任意退化场景下进行多模态图像融合。该方法通过直接回归融合图像进行隐式去噪,并设计了联合观测模型校正机制,在采样过程中同时施加退化和融合约束,从而在复杂退化条件下实现鲁棒的融合性能。
Details
Motivation: 解决现实世界图像融合任务中存在的复杂退化(如噪声、模糊、低分辨率)问题,这些退化限制了现有方法的性能和实用性。同时,弥合端到端神经网络方法(高效但可解释性差)与基于扩散的方法(提供生成先验但难以直接应用于缺乏自然融合数据的多源信息建模任务)之间的差距。
Result: 在多种融合任务和退化配置下的实验表明,该方法在复杂退化场景下具有优越性。
Insight: 创新点在于:1) 提出隐式去噪范式,通过直接回归融合图像而非预测噪声,实现了对复杂退化下多样化融合任务的灵活适应和有限步数推理;2) 设计了联合观测模型校正机制,在采样过程中同时施加退化和融合约束,确保了高重建精度。这为扩散模型应用于缺乏明确目标分布的多模态信息融合任务提供了新思路。
Abstract: Complex degradations like noise, blur, and low resolution are typical challenges in real world image fusion tasks, limiting the performance and practicality of existing methods. End to end neural network based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios.
[48] M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model cs.CVPDF
Yihang Liu, Ying Wen, Jiaxiong Yang, Longzhen Yang, Lianghua He
TL;DR: 本文提出M-IDoL,一种自监督医学基础模型,通过信息分解学习模态特异性和多样化的表示。该方法通过最大化模态间熵将多模态表示分散到可分离的专家混合子空间以实现模态特异性,并通过最小化模态内不确定性在每个子空间内进行细粒度语义判别以增强模态内多样性。在115万张医学图像上预训练后,模型在21个下游临床任务上表现出优越的泛化能力,并在五种成像模态上优于20个现有基础模型。
Details
Motivation: 现有医学基础模型存在信息模糊性问题,将多模态表示混合在单一嵌入空间中,导致模态特异性和多样性退化。
Result: 在X光、眼底、OCT、皮肤镜和病理学五种成像模态的21个下游临床任务上,M-IDoL优于20个现有基础模型,达到SOTA水平,并学习到模态特异性且多样化的表示,特征簇在模态间分离更清晰,模态内特征判别更细粒度。
Insight: 创新点在于通过信息分解框架,结合最大化模态间熵和最小化模态内不确定性两个目标,利用专家混合子空间实现多模态表示的解耦与增强,可借鉴于其他多模态学习场景以提升表示的特异性和多样性。
Abstract: Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised \underline{\textit{M}}FM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.
[49] MASS: Mesh-inellipse Aligned Deformable Surfel Splatting for Hand Reconstruction and Rendering from Egocentric Monocular Video cs.CV | cs.ROPDF
Haoyu Zhu, Yi Zhang, Lei Yao, Lap-pui Chau, Yi Wang
TL;DR: 本文提出了一种名为MASS的新方法,用于从单目第一人称视角视频中重建高保真3D手部模型。该方法利用可变形2D高斯面元表示,通过网格对齐的Steiner内切椭圆和分形致密化技术从粗糙参数化手部网格生成高分辨率面元,并结合高斯面元变形、两阶段训练策略和绑定损失来高效建模手部变形和个性化特征,实现高质量重建与渲染。
Details
Motivation: 从单目第一人称视角视频重建高保真3D手部模型面临高分辨率几何细节捕获、手-物交互以及手部复杂物体等挑战,且现有方法计算成本高,难以实时应用。
Result: 在ARCTIC、Hand Appearance和Interhand2.6M数据集上的大量实验表明,该方法相比现有最先进方法实现了更优的重建性能。
Insight: 创新点包括:1)利用网格对齐的Steiner内切椭圆和分形致密化进行网格到面元转换,提供具有逼真渲染潜力的表面表示;2)提出高斯面元变形,通过预测面元属性残差更新和使用不透明度掩码来细化几何与纹理,无需自适应密度控制;3)采用两阶段训练策略和新型绑定损失以提高优化鲁棒性和重建质量。
Abstract: Reconstructing high-fidelity 3D hands from egocentric monocular videos remains a challenge due to the limitations in capturing high-resolution geometry, hand-object interactions, and complex objects on hands. Additionally, existing methods often incur high computational costs, making them impractical for real-time applications. In this work, we propose Mesh-inellipse Aligned deformable Surfel Splatting (MASS) to address these challenges by leveraging a deformable 2D Gaussian Surfel representation. We introduce the mesh-aligned Steiner Inellipse and fractal densification for mesh-to-surfel conversion that initiates high-resolution 2D Gaussian surfels from coarse parametric hand meshes, providing surface representation with photorealistic rendering potential. Second, we propose Gaussian Surfel Deformation, which enables efficient modeling of hand deformations and personalized features by predicting residual updates to surfel attributes and introducing an opacity mask to refine geometry and texture without adaptive density control. In addition, we propose a two-stage training strategy and a novel binding loss to improve the optimization robustness and reconstruction quality. Extensive experiments on the ARCTIC dataset, the Hand Appearance dataset, and the Interhand2.6M dataset demonstrate that our model achieves superior reconstruction performance compared to state-of-the-art methods.
[50] TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches cs.CV | cs.ROPDF
Langzhe Gu, Hung-Jui Huang, Mohamad Qadri, Michael Kaess, Wenzhen Yuan
TL;DR: TouchAnything是一个利用预训练大规模2D视觉扩散模型作为语义和几何先验,从稀疏触觉测量中重建3D几何的框架。该方法通过优化问题将稀疏接触约束与扩散先验结合,仅需少量触摸即可重建准确几何,并支持对未见物体实例的开放世界3D重建。
Details
Motivation: 在遮挡或光照条件差的情况下,视觉感知不可靠,而触觉感知虽能提供直接几何信息,但仅从稀疏局部触摸重建全局3D几何存在根本性约束不足的问题。
Result: 该方法仅需少量触摸即可重建准确几何,性能优于现有基线,并在开放世界3D重建任务中实现对未见物体实例的重建。
Insight: 创新点在于将预训练视觉扩散模型中的几何知识迁移到触觉领域,作为先验指导重建,避免了训练类别特定网络或直接从触觉数据学习扩散模型,实现了跨模态知识迁移和开放世界重建能力。
Abstract: Accurate object geometry estimation is essential for many downstream tasks, including robotic manipulation and physical interaction. Although vision is the dominant modality for shape perception, it becomes unreliable under occlusions or challenging lighting conditions. In such scenarios, tactile sensing provides direct geometric information through physical contact. However, reconstructing global 3D geometry from sparse local touches alone is fundamentally underconstrained. We present TouchAnything, a framework that leverages a pretrained large-scale 2D vision diffusion model as a semantic and geometric prior for 3D reconstruction from sparse tactile measurements. Unlike prior work that trains category-specific reconstruction networks or learns diffusion models directly from tactile data, we transfer the geometric knowledge encoded in pretrained visual diffusion models to the tactile domain. Given sparse contact constraints and a coarse class-level description of the object, we formulate reconstruction as an optimization problem that enforces tactile consistency while guiding solutions toward shapes consistent with the diffusion prior. Our method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances. Our project page is https://grange007.github.io/touchanything .
[51] Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift cs.CV | cs.LGPDF
Harshith Kethavath, Weiming Hu
TL;DR: 本文研究了在遥感图像云分割任务中,视觉语言模型(如CLIPSeg)的适应策略。通过实验发现,在领域偏移显著的卫星图像上,各种提示工程方法均无法有效利用预训练模型,其性能甚至低于零样本基线;而仅使用少量标注数据(如0.1%)进行监督微调即可超越零样本性能,少量数据(5-10%)即可恢复大部分性能上限。
Details
Motivation: 解决视觉语言模型在遥感图像等分布外数据上适应困难的问题,挑战当前主流的提示工程部署范式,验证在领域不匹配显著时监督微调是否优于提示方法。
Result: 在CloudSEN12+云分割基准测试上,60种提示变体均表现不佳(最低mIoU仅0.07),低于零样本基线(0.255 mIoU);而仅用0.1%标注数据(约8张图像)的监督微调即可超越零样本性能,5-10%数据可达到最大可获mIoU的约85%。全参数微调始终优于低秩适应(LoRA),差距为0.03-0.09 mIoU。
Insight: 在视觉与语言分布均与预训练数据差异巨大的领域(如遥感),提示工程难以弥合领域差距,而少量标注数据的监督微调是更有效且值得投入的适应路径;同时,全参数微调在光谱模糊类别上表现更优,且聚合mIoU可能掩盖特定类别在低数据量下的性能下降现象。
Abstract: Adapting vision-language models to remote sensing imagery presents a fundamental challenge: both the visual and linguistic distributions of satellite data lie far outside natural image pretraining corpora. Despite this, prompting remains the dominant deployment paradigm, driven by the assumption that domain-specific language can guide frozen model representations toward specialized tasks. We test this assumption directly on a domain where the mismatch is prominent: cloud segmentation for satellite imagery. Using CLIPSeg on the CloudSEN12+ cloud segmentation benchmark, we evaluate 60 prompt variants spanning simple labels, domain terminology, appearance descriptors, and contextual cues, finding that every variant underperforms the zero-shot baseline (0.255 mIoU), with engineered prompts scoring as low as 0.07 mIoU. No amount of linguistic refinement bridges the gap between CLIP’s natural image representations and satellite spectral imagery. In contrast, supervised fine-tuning with just 0.1% labeled data (~8 images) surpasses zero-shot performance overall, and 5-10% data recovers ~85% of maximum achievable mIoU. Full fine-tuning consistently outperforms low-rank adaptation by 0.03-0.09 mIoU, with the largest gaps for spectrally ambiguous classes, and at 0.5 to 1% labeled data, fine-tuning temporarily degrades performance on these classes before recovering, a supervision dip that aggregate mIoU can mask. For practitioners adapting vision-language models to specialized imagery, our results deliver a clear message: labeled data is not the expensive alternative to prompting; it is the worthwhile path.
[52] How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms cs.CVPDF
Shengji Jin, Yuanhao Zou, Victor Zhu, Zhengping Ji, Chen Chen
TL;DR: 本文通过控制变量实验,系统比较了视频时序定位任务中三种主流输出范式(文本数字生成、时序令牌生成和连续时序解码)在相同紧凑视觉语言模型、数据集和微调协议下的性能与效率。研究发现,输出范式的选择独立于模型规模,显著影响定位精度和计算成本,其中连续分布范式在帕累托前沿上实现了最佳的效率-精度权衡。
Details
Motivation: 现有视频时序定位方法通常将输出范式与不同骨干网络、数据集和训练协议耦合,难以孤立评估输出设计的具体影响;同时,考虑到资源受限的边缘部署需求,输出范式与系统级效率之间的权衡需要系统研究。
Result: 在Charades-STA、QVHighlights和YouCook2数据集上评估了定位精度和系统效率(包括推理延迟、训练吞吐量和参数开销)。结果表明,连续分布范式在帕累托前沿上始终实现最有利的效率-精度权衡,能以最小延迟开销提供稳健的定位性能。
Insight: 创新点在于首次在控制变量设置下对VTG输出范式进行孤立比较,揭示了输出设计独立于模型规模对性能与效率的关键影响;客观实证表明连续解码范式更高效,为设计可部署的VTG系统提供了明确指导。
Abstract: While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead. Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead. These findings provide objective empirical guidelines for designing efficient, deployment-ready VTG systems.
[53] ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning cs.CVPDF
Shifeng Liu, Zhengye Zhang, Sirui Zhao, Xinglong Mao, Zhehan Kan
TL;DR: 本文提出ActFER框架,将面部表情识别(FER)重新定义为主动视觉证据获取与多模态推理相结合的任务。该框架通过动态调用人脸检测与对齐工具、选择性放大信息丰富的局部区域,并基于视觉思维链(Chain-of-Thought)对面部动作单元(AUs)和情绪进行推理,实现了主动的面部感知。
Details
Motivation: 现有基于多模态大语言模型(MLLMs)的FER方法仍采用被动范式,依赖外部准备的面部输入并进行单次推理,缺乏主动的面部感知能力。本文旨在解决这一局限,推动FER从纯标签预测转向基于推理的情感理解。
Result: 在综合实验中,ActFER通过UC-GRPO训练后,持续优于被动的基于MLLM的FER基线方法,并显著提高了AU预测的准确性。
Insight: 创新点包括:将FER重构为主动视觉证据获取与推理的代理框架;提出UC-GRPO强化学习算法,通过基于AU的多层级可验证奖励、查询条件对比效用估计和情绪感知EMA校准,实现样本感知的动态信用分配和局部检查策略学习。从客观角度看,该研究通过工具增强的主动视觉推理机制,为FER任务引入了更灵活和可解释的感知-推理流程。
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have created new opportunities for facial expression recognition (FER), moving it beyond pure label prediction toward reasoning-based affect understanding. However, existing MLLM-based FER methods still follow a passive paradigm: they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence, without the capability for active facial perception. To address this limitation, we propose ActFER, an agentic framework that reformulates FER as active visual evidence acquisition followed by multimodal reasoning. Specifically, ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through a visual Chain-of-Thought. To realize such behavior, we further develop Utility-Calibrated GRPO (UC-GRPO), a reinforcement learning algorithm tailored to agentic FER. UC-GRPO uses AU-grounded multi-level verifiable rewards to densify supervision, query-conditional contrastive utility estimation to enable sample-aware dynamic credit assignment for local inspection, and emotion-aware EMA calibration to reduce noisy utility estimates while capturing emotion-wise inspection tendencies. This algorithm enables ActFER to learn both when local inspection is beneficial and how to reason over the acquired evidence. Comprehensive experiments show that ActFER trained with UC-GRPO consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy.
[54] PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos cs.CV | cs.AIPDF
Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang, Luyang Zhang, Cheng Zhang
TL;DR: 本文提出了PinpointQA数据集与基准测试,专门用于评估多模态大语言模型在室内视频中对小物体的空间理解能力,包含四个渐进式任务,实验表明现有模型存在能力差距,而基于该数据集的监督微调能显著提升性能。
Details
Motivation: 现有基准测试未能直接评估模型在视频中定位目标物体并以足够精度表达其位置的能力,而这对物体搜索和辅助应用具有重要实用价值。
Result: 在代表性MLLMs上的实验揭示了沿渐进任务链的一致能力差距,其中结构化空间预测任务尤为困难;在PinpointQA上进行监督微调带来了显著提升,尤其是在更难的任务上。
Insight: 创新点在于构建了首个专注于室内视频小物体中心化空间理解的数据集与基准,通过自动生成与质量控制构建了四个渐进式任务,可作为诊断基准和有效训练集。
Abstract: Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.
[55] Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory cs.CVPDF
Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu
TL;DR: Matrix-Game 3.0是一个用于720p实时长视频生成的记忆增强交互式世界模型。它通过系统性的数据、模型和推理改进,解决了现有方法难以同时实现长期时间一致性和高分辨率实时生成的问题。
Details
Motivation: 解决现有交互式视频生成方法在实现记忆驱动的长期时间一致性与高分辨率实时生成之间的权衡难题,以提升其在真实世界场景中的适用性。
Result: 实验结果表明,Matrix-Game 3.0的5B模型能在720p分辨率下实现高达40 FPS的实时生成,并在分钟级序列上保持稳定的记忆一致性。扩展到2x14B模型进一步提升了生成质量、动态性和泛化能力。
Insight: 创新点包括:1)集成了合成数据、游戏采集和真实视频增强的工业级无限数据引擎;2)通过建模预测残差和重注入不完美生成帧来学习自校正,并结合相机感知的记忆检索与注入以实现长期时空一致性的训练框架;3)基于分布匹配蒸馏的多段自回归蒸馏策略,结合模型量化和VAE解码器剪枝,实现高效实时推理。这为可部署的工业级世界模型提供了实用路径。
Abstract: With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.
[56] StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding cs.CVPDF
Junxi Wang, Te Sun, Jiayi Zhu, Junxian Li, Haowen Xu
TL;DR: 本文提出了StreamMeCo,一个用于流媒体视频理解的高效代理记忆压缩框架。它通过基于记忆图连通性的节点采样和权重剪枝策略,在保持准确性的同时大幅压缩记忆存储,并引入时间衰减记忆检索机制以缓解压缩带来的性能下降。
Details
Motivation: 解决流媒体视频理解中代理记忆存储带来的巨大内存开销问题,以降低存储和计算成本。
Result: 在M3-Bench-robot、M3-Bench-web和Video-MME-Long三个基准数据集上,在70%记忆图压缩率下,实现了记忆检索速度1.87倍提升,且平均准确率提升1.0%。
Insight: 创新点在于根据记忆节点在图中的连通性(孤立节点与连接节点)设计差异化的压缩策略(边缘无关最小最大采样和边缘感知权重剪枝),并结合时间衰减检索机制,在高效压缩的同时维持甚至提升模型性能。
Abstract: Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for the isolated nodes and an edge-aware weight pruning for connected nodes, evicting the redundant memory nodes while maintaining the accuracy. In addition, we introduce a time-decay memory retrieval mechanism to further eliminate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87* speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at https://github.com/Celina-love-sweet/StreamMeCo.
[57] Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection cs.CV | cs.AI | cs.CR | cs.LGPDF
Zedian Shao, Hongbin Liu, Yuepeng Hu, Neil Zhenqiang Gong
TL;DR: 本文提出了一种名为ImageProtector的用户端方法,通过在图像中嵌入精心设计的、几乎不可察觉的扰动(作为视觉提示注入攻击),主动保护图像,防止多模态大语言模型(MLLMs)分析图像内容。当攻击者使用MLLM分析受保护图像时,模型会被诱导生成拒绝响应。
Details
Motivation: 多模态大语言模型(MLLMs)在分析大规模图像数据时,可能被滥用以从个人图像中提取敏感信息(如身份、位置等),引发安全和隐私担忧。
Result: 在六个MLLM和四个数据集上的实验证明了ImageProtector的有效性。评估了三种潜在对抗措施(高斯噪声、DiffPure和对抗训练),它们部分缓解了ImageProtector的影响,但同时也降低了模型的准确性和/或效率。
Insight: 创新点在于将视觉提示注入攻击转化为一种主动的隐私保护机制,通过微小的扰动诱导MLLM拒绝服务。研究突出了基于扰动的隐私保护在开放权重MLLM和大规模自动化图像分析场景中的潜力与局限性。
Abstract: Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as “I’m sorry, I can’t help with that request.” We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.
[58] Skill-Conditioned Visual Geolocation for Vision-Language cs.CV | cs.AIPDF
Chenjie Yang, Yutian Jiang, Chenyu Wu
TL;DR: 本文提出GeoSkill,一种基于演化技能图的免训练视觉地理定位框架,通过将专家轨迹提炼为原子技能、执行推理引导以及利用自主进化机制从网络数据中迭代合成与修剪技能,以解决现有视觉语言模型在地理定位中缺乏结构化推理和自主进化能力的问题。
Details
Motivation: 现有视觉语言模型在地理定位中依赖隐式参数化记忆,易产生过时知识和幻觉推理,且推理过程缺乏基于结果的反馈循环以实现自我进化。
Result: 在GeoRC基准测试中,GeoSkill在地理定位准确性和推理忠实度方面表现出色,并在多个外部数据集上保持优越的泛化能力。
Insight: 创新点在于引入演化技能图实现结构化地理推理,并通过自主进化机制从网络规模数据中动态扩展和修正技能,无需参数更新即可提升对真实世界地理知识的认知。
Abstract: Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a “one-off” process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system’s cognition of real-world geographic knowledge beyond isolated case studies.
[59] SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos cs.CV | cs.CL | cs.HCPDF
Xiyang Huang, Jiawei Lin, Keying Wu, Jiaxin Huang, Kailai Yang
TL;DR: 该论文提出了SiMing-Bench,这是首个基于完整临床技能视频来评估多模态大语言模型(MLLMs)程序性正确性判断能力的基准。该基准专注于评估模型如何跟踪连续交互如何更新程序状态,并据此判断后续动作的正确性。论文还引入了SiMing-Score数据集,包含医生标注的真实临床技能考试视频。实验表明,当前MLLMs在该任务上与医生判断的一致性普遍较弱。
Details
Motivation: 当前视频基准主要关注事件识别、时序排序和长上下文回忆,但忽略了专家程序性判断所需的一项更困难的能力:跟踪持续交互如何更新程序状态,从而决定后续动作的正确性。论文旨在填补这一空白。
Result: 在多种开源和闭源MLLMs上的实验结果显示,模型与医生判断的一致性持续较弱。即使整体程序级相关性看起来可以接受,但在细则定义的中间步骤上表现仍然不佳,这表明粗略的全局评估严重高估了当前模型的程序性判断能力。
Insight: 论文的创新点在于提出了首个专注于评估模型跟踪连续交互以更新程序状态并判断程序正确性的视频基准。核心洞察是,当前模型的瓶颈不在于细粒度评分或时间定位,而在于建模连续交互如何随时间更新程序状态。这为评估和理解MLLMs在复杂程序性任务上的能力提供了新的视角和工具。
Abstract: Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models’ procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.
[60] Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting cs.CVPDF
Tsuheng Hsu, Guiyu Liu, Juho Kannala, Janne Heikkilä
TL;DR: 本文提出了一种数据集级别的、以物体为中心的监督方案,用于在3D高斯泼溅(3DGS)中学习物体表征。该方法基于预训练的基于槽注意力的全局物体中心学习(GOCL)模块,学习一个与场景无关的物体码本,该码本能在不同视角和场景间提供一致的、身份锚定的表征。通过将该码本与模块的无监督物体掩码结合,可以直接监督3D高斯点的身份特征,无需额外的掩码前/后处理或多视角对齐。
Details
Motivation: 现有利用视觉基础模型(VFMs)的2D掩码来监督辐射场以实现实例级3D分割的方法,其监督信号本质上并非以物体为中心,且常需额外的掩码预处理/后处理或专门的训练与损失设计来解决跨视角的掩码身份冲突。学习到的3D场景身份是场景依赖的,限制了跨场景的泛化能力。
Result: 该方法将无监督的物体中心学习(OCL)引入3DGS,产生了更具结构化的表征,并在机器人交互、场景理解和跨场景泛化等下游任务上实现了更好的泛化性能。
Insight: 核心创新在于提出了一个与场景无关的物体码本,该码本通过预训练的GOCL模块学习得到,能提供跨视角和场景的一致性身份表征。这使得可以直接监督3D高斯点的身份特征,避免了复杂的掩码对齐处理,并实现了无需逐场景微调或重新训练即可进行物体监督与识别,提升了方法的通用性和泛化能力。
Abstract: Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module’s unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.
[61] Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy cs.CV | cs.ROPDF
Jiaheng Dai, Huanrong Liu, Tailai Zhou, Tongyu Jia, Qin Liu
TL;DR: 该论文提出了一个用于机器人辅助肾部分切除术中肾缝合精细动作分割的基准SIA-RAPN,并在该基准上评估了四种基于I3D特征的时间模型。
Details
Motivation: 解决肾缝合过程中视觉相似、持续时间可变且类别严重不平衡的缝合手势在帧级别上的识别问题。
Result: 在SIA-RAPN基准的五次运行最佳结果中,DiffAct模型在F1分数、帧级准确率、编辑分数和帧级mAP上表现最佳,而MS-TCN++在平衡准确率上最高。
Insight: 为特定外科手术(RAPN)的精细动作分割定义了专门的临床视频基准,并系统评估了现有时序模型在该领域面临类别不平衡等挑战时的性能。
Abstract: Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.
[62] Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence cs.CV | cs.MM | cs.SDPDF
Junchao Liao, Zhenghao Zhang, Xiangyu Meng, Litao Li, Ziying Zhang
TL;DR: Tora3是一个轨迹引导的音频-视频生成框架,通过将物体轨迹作为共享的运动学先验来提升物理一致性。该方法设计了轨迹对齐的运动表示、基于二阶运动状态的运动-音频对齐模块以及混合流匹配方案,并在新构建的PAV数据集上验证了其在运动真实性、音画同步和整体生成质量上的优势。
Details
Motivation: 现有音频-视频生成方法在运动-声音关系上存在不足,常产生视觉不稳定的物体运动以及声音与显著运动或接触事件松散对齐的问题,主要原因是缺乏视频和音频生成共享的显式运动感知结构。
Result: 在广泛的实验中,Tora3在运动真实性、运动-声音同步以及整体AV生成质量上均优于强大的开源基线模型。
Insight: 核心创新在于将物体轨迹作为视频和音频生成的共享运动学先验进行联合引导,并设计了相应的轨迹对齐表示、运动-音频对齐机制以及混合流匹配方案,以增强物理一致性。从客观角度看,其构建强调运动相关模式的大规模数据集PAV也为该领域提供了有价值的资源。
Abstract: Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large-scale AV dataset emphasizing motion-relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.
[63] Learning Vision-Language-Action World Models for Autonomous Driving cs.CV | cs.AIPDF
Guoqing Wang, Pin Tang, Xiangxuan Ren, Guodongfang Zhao, Bailan Feng
TL;DR: 本文提出VLA-World模型,一种结合视觉-语言-动作(VLA)与世界模型的方法,用于自动驾驶。该模型通过动作引导生成未来帧图像,并基于自生成的未来帧进行推理以优化轨迹预测,从而提升驾驶的预见性和安全性。
Details
Motivation: 现有VLA模型缺乏对时序动态和全局一致性的显式建模,限制了其预见能力;而世界模型虽能模拟未来场景,但难以对生成的未来进行推理或评估。本文旨在统一预测性想象与反思性推理,以改善驾驶预见性。
Result: 在nuScenes数据集衍生的生成推理数据集nuScenes-GR-20K上,VLA-World在规划与未来生成基准测试中均持续超越最先进的VLA和世界模型基线方法,达到SOTA水平。
Insight: 创新点在于将VLA框架与世界模型结合,通过动作引导的未来帧生成与基于生成帧的反思性推理循环,实现了预测与推理的统一;客观来看,其提出的三阶段训练策略(预训练、监督微调、强化学习)和专用数据集构建方法也具有借鉴意义。
Abstract: Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: https://vlaworld.github.io
[64] Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition cs.CV | cs.AIPDF
Yuxi Zhou, Zhengbo Zhang, Jingyu Pan, Zhiyu Lin, Zhigang Tu
TL;DR: 本文提出了一种名为FDSM的频率增强扩散模型,用于零样本骨架动作识别。该方法通过语义引导的频谱残差模块、时间步自适应频谱损失和基于课程学习的语义抽象,解决了扩散模型在频谱偏差上过度平滑高频动态的问题,从而有效恢复了细粒度运动细节。
Details
Motivation: 零样本骨架动作识别面临扩散模型频谱偏差导致高频动态过度平滑的挑战,限制了其对新动作的泛化能力。
Result: 在NTU RGB+D、PKU-MMD和Kinetics-skeleton数据集上实现了最先进的性能。
Insight: 创新点包括引入频率感知机制来增强扩散模型对高频细节的捕捉,以及课程引导的语义对齐策略,可借鉴于其他基于扩散模型的零样本学习任务中提升细节恢复能力。
Abstract: Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/
[65] Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation cs.CVPDF
Yutong Zhang, Jiaxin Chen, Honglin Chen, Kaiqi Zheng, Shengcai Liao
TL;DR: 本文提出了一种名为掩码双路径蒸馏(MDPD)的新型内存高效迁移学习方法,旨在解决现有方法在推理阶段引入额外侧网络导致内存和时间开销增加的问题。该方法通过在微调阶段对冻结的主干网络和可学习的侧网络进行相互蒸馏来提升性能,并在推理时丢弃侧网络以加速推理且不损失精度。
Details
Motivation: 现有内存高效迁移学习方法虽能减少微调时的可训练参数和内存消耗,但因其使用轻量级可学习侧网络,在推理时引入了额外的内存与时间开销,这与高效迁移学习的最终目标相悖。
Result: 在视觉、语言以及视觉-语言任务上的大量实验表明,该方法在保持参数和内存消耗可比的同时,至少加速推理25.2%,并且与SOTA方法相比显著提升了准确率。
Insight: 创新点在于提出了一个双路径蒸馏框架,实现了微调阶段主干与侧网络的相互知识蒸馏,并设计了适用于多层编码器结构的基于特征的知识蒸馏方法,从而在推理时无需侧网络,兼顾了训练效率与推理速度。
Abstract: Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at https://github.com/Zhang-VKk/MDPD.
[66] Off-the-shelf Vision Models Benefit Image Manipulation Localization cs.CV | cs.MM | eess.IVPDF
Zhengxuan Zhang, Keji Song, Junmin Hu, Ao Luo, Yuezun Li
TL;DR: 本文提出了一种新颖的视角,认为图像篡改定位与通用视觉任务之间存在内在联系,通用语义先验可以有益于篡改定位。基于此,作者设计了一个名为ReVi的可训练适配器,能够重用现成的通用视觉模型(如图像生成和分割网络)进行篡改定位,通过解耦模型中的语义冗余和篡改特定信息来增强后者。该方法无需重新设计模型或完全重训练,仅需微适配器,实验证明了其优越性。
Details
Motivation: 解决图像篡改定位与通用视觉任务因特征差异而被割裂研究的问题,探索并利用通用语义先验来提升篡改定位性能。
Result: 实验结果表明该方法具有优越性,展现了构建可扩展图像篡改定位框架的潜力。
Insight: 创新点在于建立了通用视觉任务与篡改定位任务之间的桥梁,并提出了一个基于鲁棒主成分分析思想的适配器,能够高效利用现成模型的冻结参数,仅需微调轻量适配器即可实现性能提升,为构建可扩展的篡改定位系统提供了新思路。
Abstract: Image manipulation localization (IML) and general vision tasks are typically treated as two separate research directions due to the fundamental differences between manipulation-specific and semantic features. In this paper, however, we bridge this gap by introducing a fresh perspective: these two directions are intrinsically connected, and general semantic priors can benefit IML. Building on this insight, we propose a novel trainable adapter (named ReVi) that repurposes existing off-the-shelf general-purpose vision models (e.g., image generation and segmentation networks) for IML. Inspired by robust principal component analysis, the adapter disentangles semantic redundancy from manipulation-specific information embedded in these models and selectively enhances the latter. Unlike existing IML methods that require extensive model redesign and full retraining, our method relies on the off-the-shelf vision models with frozen parameters and only fine-tunes the proposed adapter. The experimental results demonstrate the superiority of our method, showing the potential for scalable IML frameworks.
[67] Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch cs.CV | cs.ROPDF
Gabriele Mario Caddeo, Pasquale Marra, Lorenzo Natale
TL;DR: 本文提出了一种多模态、物理基础的方法,用于在严重手部遮挡下进行度量尺度的非模态物体重建与姿态估计。该方法利用物理交互信号(本体感觉提供手部几何姿态,多触点触觉约束物体表面位置)来减少遮挡区域的歧义,通过结合视觉、遮挡/可见性掩码、手部潜在表示和触觉信息,训练条件流匹配扩散模型,并融入基于物理的目标和可微分解码器引导,以生成度量尺度、物理一致的结构估计。
Details
Motivation: 解决在严重手部遮挡下仅依赖视觉的现有方法难以准确进行三维生成重建和姿态估计的问题,通过引入物理交互信号(本体感觉和触觉)来提供额外约束,减少遮挡带来的歧义。
Result: 在仿真实验中,相比仅视觉的基线方法,加入本体感觉和触觉显著改善了遮挡下的补全效果,并生成了物理合理、具有正确真实世界尺度的重建结果;模型还成功迁移部署到具有不同末端执行器的真实人形机器人上进行了验证。
Insight: 创新点在于将物理交互信号(本体感觉和多触点触觉)与视觉信息融合,用于遮挡下的三维生成重建,并通过基于物理的目标和可微分引导确保重建的物理一致性;客观分析认为,该方法通过多模态融合和物理约束,有效提升了在严重遮挡场景下的重建鲁棒性和准确性,且具有较好的可迁移性。
Abstract: We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand–object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.
[68] FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval cs.CV | cs.LGPDF
François Gardères, Camille-Sovanneary Gauthier, Jean Ponce, Shizhe Chen
TL;DR: 本文提出了FIRE-CIR模型,用于解决组合图像检索(CIR)任务,特别是在时尚领域。该模型通过自动生成基于修改文本的属性焦点视觉问题,并对参考图像和候选图像进行视觉证据验证,实现了细粒度的组合推理,从而提升了检索性能与可解释性。
Details
Motivation: 现有基于视觉语言模型(VLMs)的CIR方法仅依赖嵌入相似度,缺乏对需要保留和更改内容的显式推理,导致在时尚等细粒度领域性能不佳且可解释性差。
Result: 在Fashion IQ基准测试上,FIRE-CIR在检索准确率上超越了最先进(SOTA)的方法。
Insight: 核心创新在于将组合检索任务转化为问题驱动的视觉推理过程,通过自动构建大规模时尚视觉问答数据集来训练系统,并利用显式推理对候选结果进行重排序,从而提供属性级的可解释性。
Abstract: Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.
[69] Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching cs.CVPDF
Jiahao Li, Xinhong Chen, Zhengmin Jiang, Cheng Huang, Yung-Hui Li
TL;DR: 该论文提出了一个名为GREATEN的框架,旨在通过引入表面法线作为领域不变、物体本质且具有判别性的几何线索,来提升立体匹配模型从合成数据到真实场景的零样本泛化能力。该框架包含三个核心组件:门控上下文-几何融合模块、镜面/透明增强策略以及稀疏注意力设计,以应对跨域偏移和图像纹理固有的病态模糊问题。
Details
Motivation: 解决立体匹配中合成数据到真实场景的零样本泛化性能不佳的问题,该问题主要由跨域偏移以及图像纹理在遮挡、无纹理、重复和非朗伯(镜面/透明)区域固有的模糊性所导致。
Result: 在仅使用合成数据(如SceneFlow)训练后,GREATEN-IGEV在多个真实数据集上取得了优异的零样本泛化性能:相比FoundationStereo、Monster-Stereo和DEFOM-Stereo,在ETH3D上误差降低30%,在非朗伯数据集Booster上降低8.5%,在KITTI-2015上降低14.1%。同时,其推理速度比GREAT-IGEV快19.2%,并支持高分辨率(3K)推理。
Insight: 创新点在于将表面法线作为领域不变的几何先验来增强特征表示,并设计了自适应的融合模块和针对非朗伯区域的增强策略。客观来看,其稀疏注意力机制在保持全局特征提取能力的同时显著降低了计算开销,是一种高效且鲁棒的架构设计思路。
Abstract: Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.
[70] Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery cs.CV | nlin.PSPDF
Sara Ameli
TL;DR: 本文在机器人辅助手术场景下,对UNet、UNet++、DeepLabV3、Attention UNet和SegFormer五种深度学习模型在SAR-RARP50数据集上进行了手术器械多类别语义分割的基准测试。研究发现,基于卷积的DeepLabV3与基于Transformer的SegFormer性能相当,均能有效处理复杂的手术场景。
Details
Motivation: 机器人辅助手术中手术器械的精确分割对于实现工具跟踪、工作流分析和自主决策等上下文感知的计算机辅助干预至关重要,本研究旨在为手术AI应用中的模型选择提供全面的比较和实用见解。
Result: 在SAR-RARP50数据集(真实世界根治性前列腺切除术视频)上的实验表明,UNet和Attention UNet等卷积模型提供了强大的基线性能,而DeepLabV3与SegFormer取得了可比的结果,展示了空洞卷积和多尺度上下文聚合在捕捉复杂手术场景中的有效性。
Insight: 论文的创新点在于系统性地比较了卷积与Transformer架构在手术器械分割任务上的表现。客观来看,其核心洞察在于:在特定数据集和任务上,精心设计的卷积模型(如利用空洞卷积的DeepLabV3)在捕获局部细节和复杂场景方面,其性能可以与强调全局上下文理解的Transformer模型(如SegFormer)相媲美,这为实际应用中的模型选型(需权衡计算成本与性能)提供了重要参考。
Abstract: Accurate segmentation of surgical instruments in robotic-assisted surgery is critical for enabling context-aware computer-assisted interventions, such as tool tracking, workflow analysis, and autonomous decision-making. In this study, we benchmark five deep learning architectures-UNet, UNet, DeepLabV3, Attention UNet, and SegFormer on the SAR-RARP50 dataset for multi-class semantic segmentation of surgical instruments in real-world radical prostatectomy videos. The models are trained with a compound loss function combining Cross Entropy and Dice loss to address class imbalance and capture fine object boundaries. Our experiments reveal that while convolutional models such as UNet and Attention UNet provide strong baseline performance, DeepLabV3 achieves results comparable to SegFormer, demonstrating the effectiveness of atrous convolution and multi-scale context aggregation in capturing complex surgical scenes. Transformer-based architectures like SegFormer further enhance global contextual understanding, leading to improved generalization across varying instrument appearances and surgical conditions. This work provides a comprehensive comparison and practical insights for selecting segmentation models in surgical AI applications, highlighting the trade-offs between convolutional and transformer-based approaches.
[71] Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection cs.CVPDF
Yicheng Qiu, Keiji Yanai
TL;DR: 本文提出了一种用于时序动作检测的高效时空焦点适配器(ESTF Adapter),它结合了提出的时序边界感知状态空间模型(TB-SSM)来处理时序特征,并高效处理空间特征,旨在解决长视频序列中特征冗余和全局依赖建模能力下降的问题。
Details
Motivation: 现有CNN和Transformer模型在处理长视频序列时存在特征冗余和全局依赖建模能力下降的局限性,这制约了其在真实世界视频分析中的可扩展性。状态空间模型(SSMs)因其线性的长期建模和强大的全局时序推理能力,为时序建模提供了有前景的替代方案。
Result: 在多个基准测试上进行的全面定量分析表明,所提方法显著提升了定位性能和鲁棒性,验证了其有效性,并与之前基于SSM的方法及其他结构方法进行了比较。
Insight: 创新点在于将提出的时序边界感知SSM(TB-SSM)与高效的空间特征处理相结合,构建了ESTF Adapter模块,并将其集成到预训练层中,以增强对长视频的时空建模能力。
Abstract: Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.
[72] MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding cs.CV | cs.MAPDF
Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu
TL;DR: MAG-3D是一个无需训练的多智能体框架,利用现成的视觉语言模型进行3D场景的接地推理。它通过规划、接地和编码三个智能体的动态协作,分解任务、定位相关物体与区域,并进行几何推理与验证,从而在复杂3D场景中实现灵活、零样本的开放查询回答。
Details
Motivation: 现有视觉语言模型在3D场景的接地推理方面探索不足,且现有方法通常依赖领域内微调或手工设计的推理流程,限制了其灵活性和对新环境的零样本泛化能力。
Result: 该方法在具有挑战性的基准测试上实现了最先进的性能。
Insight: 创新点在于提出了一个无需训练的多智能体协作框架,通过动态协调规划、接地和编码三个专家智能体,分别负责任务分解、自由形式3D接地与帧检索、以及通过可执行程序进行几何推理与显式验证,从而实现了灵活、零样本的3D接地推理。
Abstract: Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.
[73] ELT: Elastic Looped Transformers for Visual Generation cs.CVPDF
Sahil Goyal, Swayam Agrawal, Gautham Govind Anil, Prateek Jain, Sujoy Paul
TL;DR: 本文提出了一种名为弹性循环Transformer(ELT)的高参数效率视觉生成模型,它基于循环Transformer架构,通过权重共享的循环块大幅减少参数数量,同时保持高质量生成。为了有效训练,作者引入了循环内自蒸馏(ILSD)方法,确保模型深度一致性。该框架支持单次训练产生弹性模型系列,实现任意时间推理,在计算成本与生成质量间动态权衡。
Details
Motivation: 传统生成模型依赖深度堆叠的独特Transformer层,参数量大。本文旨在设计一种参数高效的视觉生成模型,通过循环权重共享减少参数,同时维持合成质量,并实现计算与质量的灵活权衡。
Result: 在同等推理计算设置下,ELT将参数减少4倍,在类别条件ImageNet 256×256上达到FID 2.0的竞争性结果,在类别条件UCF-101上达到FVD 72.8,显著提升了视觉合成的效率边界。
Insight: 创新点包括循环Transformer架构实现参数高效性,以及循环内自蒸馏(ILSD)训练方法确保深度一致性;从客观角度看,其弹性模型设计和任意时间推理能力为生成模型提供了灵活的效率-质量权衡方案。
Abstract: We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model’s depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-conditional UCF-101.
[74] UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation cs.CVPDF
Le-Van Thai, Tien Dat Nguyen, Hoai Nhan Pham, Lan Anh Dinh Thi, Duy-Dong Nguyen
TL;DR: 本文提出了一种名为UniSemAlign的双模态语义对齐框架,用于解决计算病理学中半监督语义分割的挑战。该框架基于病理学预训练的Transformer编码器,通过引入互补的原型级和文本级对齐分支,在共享嵌入空间中注入显式的类别级结构,从而减少类别模糊性并稳定伪标签优化。最终,对齐后的表示与视觉预测融合,为未标记的组织病理学图像生成更可靠的监督信号。
Details
Motivation: 计算病理学中的半监督语义分割面临像素级标注稀缺和伪标签监督不可靠的挑战。本文旨在通过注入明确的类别级结构来增强视觉分割,从而解决这些问题。
Result: 在GlaS和CRAG数据集上的大量实验表明,UniSemAlign在有限监督条件下显著优于最近的半监督基线方法。具体而言,在仅使用10%标注数据时,在GlaS数据集上Dice系数提升高达2.6%,在CRAG数据集上提升高达8.6%;在20%监督比例下也实现了强劲的性能提升。
Insight: 论文的创新点在于提出了一个结合原型级和文本级对齐的双模态语义对齐框架,将明确的语义结构注入像素级学习过程,以稳定伪标签优化。从客观角度看,这种多模态对齐策略为半监督分割提供了一种新的结构化引导方式,可借鉴其利用预训练基础编码器和跨模态对齐来增强表示学习的思想。
Abstract: Semi-supervised semantic segmentation in computational pathology remains challenging due to scarce pixel-level annotations and unreliable pseudo-label supervision. We propose UniSemAlign, a dual-modal semantic alignment framework that enhances visual segmentation by injecting explicit class-level structure into pixel-wise learning. Built upon a pathology-pretrained Transformer encoder, UniSemAlign introduces complementary prototype-level and text-level alignment branches in a shared embedding space, providing structured guidance that reduces class ambiguity and stabilizes pseudo-label refinement. The aligned representations are fused with visual predictions to generate more reliable supervision for unlabeled histopathology images. The framework is trained end-to-end with supervised segmentation, cross-view consistency, and cross-modal alignment objectives. Extensive experiments on the GlaS and CRAG datasets demonstrate that UniSemAlign substantially outperforms recent semi-supervised baselines under limited supervision, achieving Dice improvements of up to 2.6% on GlaS and 8.6% on CRAG with only 10% labeled data, and strong improvements at 20% supervision. Code is available at: https://github.com/thailevann/UniSemAlign
[75] Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma cs.CV | cs.AIPDF
Francesca Fati, Felipe Coutinho, Marika Reinius, Marina Rosanu, Gabriel Funingana
TL;DR: 本研究提出了一种基于预训练Vision Transformer的2.5D多模态深度学习框架,用于术前预测高级别浆液性卵巢癌(HGSOC)患者对新辅助化疗(NACT)的组织病理学化疗反应评分(CRS)。该框架整合了CT影像中的病灶密集网膜切片特征和临床变量,旨在为多学科团队(MDT)提供早期、无创的治疗反应预测,以辅助治疗决策。
Details
Motivation: 高级别浆液性卵巢癌(HGSOC)异质性强且常晚期确诊,新辅助化疗(NACT)后的化疗反应评分(CRS)是有效的术后组织病理学生物标志物,但无法在术前获得。本研究旨在探索利用治疗前的CT影像和临床数据来术前预测CRS,作为一种研究性的决策支持工具,以辅助MDT评估预期治疗反应。
Result: 在内部测试集(IEO, n=41)上,整合影像与临床数据的多模态模型取得了ROC-AUC 0.95、准确率95%、精确率80%的优异性能。在外部测试集(OV04, n=70)上,模型性能有所下降,但仍达到ROC-AUC 0.68、准确率67%、精确率75%。
Insight: 创新点在于提出了一个结合预训练Vision Transformer编码器与中间融合模块的2.5D多模态框架,将CT影像的视觉表征与临床变量有效融合,用于术前预测化疗反应。从客观角度看,该研究展示了基于Transformer的深度学习在利用常规临床数据和CT影像进行术前生物标志物预测方面的可行性和潜力,特别是在医学影像分析领域应用先进的视觉Transformer架构进行端到端特征学习与多模态信息整合。
Abstract: Purpose. High-grade serous ovarian carcinoma (HGSOC) is characterized by pronounced biological and spatial heterogeneity and is frequently diagnosed at an advanced stage. Neoadjuvant chemotherapy (NACT) followed by delayed primary surgery is commonly employed in patients unsuitable for primary cytoreduction. The Chemotherapy Response Score (CRS) is a validated histopathological biomarker of response to NACT, but it is only available postoperatively. In this study, we investigate whether pre-treatment computed tomography (CT) imaging and clinical data can be used to predict CRS as an investigational decision-support adjunct to inform multidisciplinary team (MDT) discussions regarding expected treatment response. Methods. We proposed a 2.5D multimodal deep learning framework that processes lesion-dense omental slices using a pre-trained Vision Transformer encoder and integrates the resulting visual representations with clinical variables through an intermediate fusion module to predict CRS. Results. Our multimodal model, integrating imaging and clinical data, achieved a ROC-AUC of 0.95 alongside 95% accuracy and 80% precision on the internal test cohort (IEO, n=41 patients). On the external test set (OV04, n=70 patients), it achieved a ROC-AUC of 0.68, alongside 67% accuracy and 75% precision. Conclusion. These preliminary results demonstrate the feasibility of transformer-based deep learning for preoperative prediction of CRS in HGSOC using routine clinical data and CT imaging. As an investigational, pre-treatment decision-support tool, this approach may assist MDT discussions by providing early, non-invasive estimates of treatment response.
[76] Visually-Guided Policy Optimization for Multimodal Reasoning cs.CV | cs.AI | cs.CLPDF
Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang
TL;DR: 本文提出了视觉引导策略优化(VGPO)框架,旨在解决视觉语言模型在强化学习过程中视觉关注不足和时序视觉遗忘的问题。VGPO通过视觉注意力补偿机制增强视觉线索,并采用双粒度优势重加权策略,在数学多模态推理和视觉依赖任务中取得了更好的视觉激活和性能。
Details
Motivation: 当前基于可验证奖励的强化学习(RLVR)虽提升了视觉语言模型的推理能力,但模型仍以文本为主导,导致视觉忠实性不足,表现为对视觉token的注意力激活稀疏,且推理步骤中存在时序视觉遗忘现象。
Result: 在数学多模态推理和视觉依赖任务上的大量实验表明,VGPO实现了更好的视觉激活和更优的性能。
Insight: 创新点在于提出了视觉注意力补偿机制来定位和放大视觉线索以对抗视觉遗忘,并设计了双粒度(轨迹内和轨迹间)优势重加权策略来强化视觉关注;客观来看,该方法针对VLMs中视觉信息利用不足的核心问题,提供了系统性的优化方案。
Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.
[77] Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts cs.CV | cs.CLPDF
Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, Kurt Stockinger
TL;DR: 本文探讨了视觉语言模型在处理视觉-语言冲突时的失败原因,发现主要问题在于仲裁而非感知缺陷。通过多模态仲裁交叉分析和层间Logit Lens探测,揭示了视觉证据在早期层已有效编码,但最终决策受先验知识干扰。研究通过激活修补和训练无关的激活引导技术,证明早期层干预可提升视觉接地性能。
Details
Motivation: 解决视觉语言模型在视觉与语言信息冲突时(如看到蓝色香蕉却回答“黄色”)的错误根源,探究是感知能力不足还是决策仲裁失败。
Result: 在十个不同规模的VLM上,视觉属性在早期层可线性解码(AUC > 0.86),编码强度与正确/错误样本无显著差异;通过全序列激活修补可改变60-84%的输出,早期层线性或稀疏自编码器引导的激活操控将视觉接地性能提升最高+3.8%。
Insight: 创新点在于提出编码-接地分离现象,并开发MAC分析和层间因果干预方法;客观而言,研究强调了VLM的视觉编码能力已足够,核心瓶颈在于多模态信号仲裁机制,为通过针对性干预改善模型可靠性提供了新方向。
Abstract: When a Vision-Language Model (VLM) sees a blue banana and answers “yellow”, is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding–Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC > 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit – not the strength of encoding – better predicts grounding outcomes with a correlation of . After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering – both linear and sparse autoencoder-guided – in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.
[78] CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation cs.CVPDF
Haoyu Zhao, Zihao Zhang, Jiaxi Gu, Haoran Chen, Qingping Zheng
TL;DR: 本文提出了一种名为CT-1的视觉-语言-相机模型,旨在将空间推理知识迁移到相机可控的视频生成中。该模型通过准确估计相机轨迹,并结合基于小波的正则化损失在频域学习复杂轨迹分布,最终将轨迹集成到视频扩散模型中,实现与用户意图一致的空间感知相机控制。
Details
Motivation: 现有相机可控视频生成方法存在文本提示控制不精确或依赖人工设定相机轨迹参数的问题,限制了其在自动化场景中的应用。本文旨在解决这些问题,实现更精确、自动化的相机控制。
Result: 实验结果表明,该框架成功弥合了空间推理与视频合成之间的差距,生成了忠实且高质量的相机可控视频,在相机控制准确性上比先前方法提高了25.7%。
Insight: 主要创新点包括:1) 提出专门的Vision-Language-Camera模型CT-1来迁移空间推理知识;2) 在频域引入基于小波的正则化损失以有效学习复杂相机轨迹分布;3) 构建了大规模数据集CT-200K以支持模型训练。
Abstract: Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.
[79] VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning cs.CV | cs.AI | cs.CLPDF
Wenyi Xiao, Xinchi Xu, Leilei Gan
TL;DR: 本文提出VL-Calibration,一种用于大型视觉语言模型(LVLMs)的强化学习框架,旨在解耦视觉置信度和推理置信度,以解决LVLMs在推理过程中经常出现高置信度幻觉和错误响应的问题。该方法通过结合图像扰动下的KL散度视觉基础性和内部令牌熵来估计视觉确定性,并采用令牌级优势重加权进行优化,在13个基准测试中有效提升了校准效果和视觉推理准确性,且能泛化到不同模型规模和架构的分布外基准。
Details
Motivation: 现有基于文本LLM的置信度校准方法通常使用二元答案正确性优化单一整体置信度,这与LVLMs不匹配,因为错误预测可能源于感知失败或正确感知下的推理错误,且单一置信度混淆了这些来源,同时视觉不确定性常被语言先验主导。
Result: 在13个基准测试上的实验表明,VL-Calibration有效改善了校准效果,同时提升了视觉推理准确性,并能泛化到不同模型规模和架构的分布外基准。
Insight: 创新点在于将置信度显式解耦为视觉置信度和推理置信度,并引入无需真实感知标签的视觉确定性估计(结合KL散度视觉基础性和令牌熵内部确定性),以及令牌级优势重加权优化,以抑制无根据的幻觉同时保留有效感知。
Abstract: Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.
[80] VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images cs.CV | cs.AI | cs.CLPDF
Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu
TL;DR: VisionFoundry是一个任务感知的合成数据生成流程,仅需任务名称作为输入,利用大语言模型生成问题、答案和文生图提示,再通过文生图模型合成图像,并使用专有视觉语言模型验证一致性,无需参考图像或人工标注。基于此构建了包含10个任务、1万个图像-问题-答案三元组的合成视觉问答数据集VisionFoundry-10K。在视觉感知基准测试中,使用该数据集训练的模型取得了显著提升。
Details
Motivation: 视觉语言模型在空间理解和视角识别等视觉感知任务上仍存在不足,自然图像数据集对低级视觉技能的监督有限,因此研究能否通过仅基于任务关键词(如深度顺序)生成的有针对性的合成监督来弥补这些弱点。
Result: 在MMVP基准上提升7%,在CV-Bench-3D基准上提升10%,同时保持了更广泛的能力,并显示出随着数据量增加而有利的扩展行为。
Insight: 创新点在于提出了一种完全自动化的、仅需任务名称的合成数据生成流程,无需人工标注或参考图像,证明了有限的任务针对性监督是视觉语言模型瓶颈的重要因素,合成监督是系统性训练视觉语言模型的有前景路径。
Abstract: Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.
[81] TinyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference cs.CVPDF
Muhammad Hannan Akhtar, Ihab Amer, Tamer Shanableh
TL;DR: 该论文系统研究了用于高效部署的紧凑型神经视频表示(NeRV)架构,提出了两种轻量级配置NeRV-T和NeRV-T+,并通过容量缩放、知识蒸馏和低精度推理策略,在多个视频数据集上评估了其在重建质量、计算复杂度和解码吞吐量方面的表现,旨在探索在资源受限环境中部署NeRV模型的实用极限。
Details
Motivation: 现有神经视频表示研究多关注中高容量模型,对资源受限环境所需的极端紧凑配置行为探索不足,本文旨在系统研究如何设计高效的tiny NeRV架构以实现实际部署。
Result: 实验结果表明,精心设计的tiny NeRV变体在多个视频数据集上实现了良好的质量-效率权衡,显著减少了参数量、计算成本和内存需求,为在资源受限和实时环境中部署NeRV风格模型提供了指导。
Insight: 创新点包括引入两种轻量级NeRV配置进行系统容量缩放分析,提出基于频率感知焦点监督的知识蒸馏以提升低容量网络重建保真度,并研究后训练量化和量化感知训练对低精度推理的影响,探索了紧凑模型在降低数值精度下的鲁棒性。
Abstract: Implicit neural video representations encode entire video sequences within the parameters of a neural network and enable constant time frame reconstruction. Recent work on Neural Representations for Videos (NeRV) has demonstrated competitive reconstruction performance while avoiding the sequential decoding process of conventional video codecs. However, most existing studies focus on moderate or high capacity models, leaving the behavior of extremely compact configurations required for constrained environments insufficiently explored. This paper presents a systematic study of tiny NeRV architectures designed for efficient deployment. Two lightweight configurations, NeRV-T and NeRV-T+, are introduced and evaluated across multiple video datasets in order to analyze how aggressive capacity reduction affects reconstruction quality, computational complexity, and decoding throughput. Beyond architectural scaling, the work investigates strategies for improving the performance of compact models without increasing inference cost. Knowledge distillation with frequency-aware focal supervision is explored to enhance reconstruction fidelity in low-capacity networks. In addition, the impact of lowprecision inference is examined through both post training quantization and quantization aware training to study the robustness of tiny models under reduced numerical precision. Experimental results demonstrate that carefully designed tiny NeRV variants can achieve favorable quality efficiency trade offs while substantially reducing parameter count, computational cost, and memory requirements. These findings provide insight into the practical limits of compact neural video representations and offer guidance for deploying NeRV style models in resource constrained and real-time environments. The official implementation is available at https: //github.com/HannanAkhtar/TinyNeRV-Implementation.
[82] FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding cs.CV | cs.IRPDF
Kaidong Feng, Zhuoxuan Huang, Huizhong Guo, Yuting Jin, Xinyu Chen
TL;DR: 本文提出了FashionStylist,一个由时尚专家标注的多模态数据集,用于支持整体和专家级的时尚理解。该数据集在单品和搭配两个层面提供专业标注,支持搭配到单品定位、搭配补全和搭配评估三个核心任务,旨在解决现有数据集碎片化、任务特定且缺乏专业推理支持的问题。
Details
Motivation: 现有时尚数据集通常专注于单品属性、搭配共现或弱文本监督,呈现碎片化且任务特定的特点,无法支持对搭配风格、场合、兼容性和整体合理性的专家级推理,因此需要构建一个统一的、专家标注的基准来促进整体时尚理解。
Result: 实验结果表明,FashionStylist不仅可作为多个时尚任务的统一基准,还能作为有效的训练资源,提升基于多模态大语言模型的时尚系统在定位、补全和搭配级语义评估方面的性能。
Insight: 主要创新点在于通过专门的时尚专家标注流程,构建了一个提供专业级、多层次(单品和搭配)标注的统一数据集,并定义了覆盖真实复杂场景(如分层和配饰)和专家级评估(风格、季节、场合、整体一致性)的三个代表性任务,弥补了现有数据集的不足。
Abstract: Fashion understanding requires both visual perception and expert-level reasoning about style, occasion, compatibility, and outfit rationale. However, existing fashion datasets remain fragmented and task-specific, often focusing on item attributes, outfit co-occurrence, or weak textual supervision, and thus provide limited support for holistic outfit understanding. In this paper, we introduce FashionStylist, an expert-annotated benchmark for holistic and expert-level fashion understanding. Constructed through a dedicated fashion-expert annotation pipeline, FashionStylist provides professionally grounded annotations at both the item and outfit levels. It supports three representative tasks: outfit-to-item grounding, outfit completion, and outfit evaluation. These tasks cover realistic item recovery from complex outfits with layering and accessories, compatibility-aware composition beyond co-occurrence matching, and expert-level assessment of style, season, occasion, and overall coherence. Experimental results show that FashionStylist serves not only as a unified benchmark for multiple fashion tasks, but also as an effective training resource for improving grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.
[83] Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization cs.CV | cs.AIPDF
Yuqin Lan, Gen Li, Yuanze Hu, Weihao Shen, Zhaoxin Fan
TL;DR: 本文提出Mosaic,一种针对闭源视觉语言模型的多模态越狱攻击方法,通过多视角集成优化缓解异构代理-目标设置下的代理依赖问题。该方法结合文本侧变换、多视角图像优化和代理集成引导三个核心模块,在安全基准测试中实现了最先进的攻击成功率和平均毒性。
Details
Motivation: 现有基于梯度的对抗优化方法通常在同构开源代理-目标设置下优化和评估,其在异构商业闭源VLM上的有效性不明确。研究发现同构与异构设置间存在一致性能差距(代理依赖现象),因此提出Mosaic以减轻对单一代理模型和视觉视角的过度依赖。
Result: 在安全基准测试中,Mosaic对商业闭源VLM实现了最先进的攻击成功率(Attack Success Rate)和平均毒性(Average Toxicity),达到SOTA水平。
Insight: 创新点包括:1) 提出代理依赖现象并设计多视角集成优化框架;2) 文本侧变换模块扰动拒绝敏感词汇模式;3) 多视角图像优化避免对单一视觉视角过拟合;4) 代理集成引导聚合多个代理VLM的优化信号以减少代理特定偏差。从客观看,该方法将对抗攻击从同构设置推广到更实际的异构闭源场景,具有实际安全评估价值。
Abstract: Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.
[84] GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic cs.CVPDF
Jiayuan Lu, Rengan Xie, Xuancheng Jin, Zhizhen Wu, Qi Ye
TL;DR: 本文提出了GeRM,一种生成式渲染模型,旨在弥合基于物理的渲染(PBR)与照片级真实感渲染(PRR)之间的差距。它通过整合G-buffer等物理属性和文本提示,利用学习到的分布转移向量场(DTV Field),实现从物理真实到感知真实的可控图像生成。
Details
Motivation: 解决PBR与PRR之间的差距(P2P gap):PBR依赖难以获取的真实世界数字模型,而隐式生成模型则牺牲了可控性和几何一致性。论文旨在统一两者,实现可控的照片级真实感渲染。
Result: 论文构建了P2P-50K数据集,并提出了多条件ControlNet来学习DTV Field,能够合成PBR图像并逐步过渡到PRR图像。摘要中未提及具体的定量基准测试结果或SOTA比较。
Insight: 创新点包括:首次提出P2P gap问题并构建专家指导的配对数据集P2P-50K;提出GeRM模型,通过分布转移向量场统一PBR和PRR,实现物理属性与文本提示的多条件可控生成,允许用户在物理保真度和感知真实感之间连续导航。
Abstract: For decades, Physically-Based Rendering (PBR) is the fundation of synthesizing photorealisitic images, and therefore sometimes roughly referred as Photorealistic Rendering (PRR). While PBR is indeed a mathematical simulation of light transport that guarantees physical reality, photorealism has additional reliance on the realistic digital model of geometry and appearance of the real world, leaving a barely explored gap from PBR to PRR (P2P). Consequently, the path toward photorealism faces a critical dilemma: the explicit simulation of PRR encumbered by unreachable realistic digital models for real-world existence, while implicit generation models sacrifice controllability and geometric consistency. Based on this insight, this paper presents the problem, data, and approach of mitigating P2P gap, followed by the first multi-modal generative rendering model, dubbed GeRM, to unify PBR and PRR. GeRM integrates physical attributes like G-buffers with text prompts, and progressive incremental injection to generate controllable photorealistic images, allowing users to fluidly navigate the continuum between strict physical fidelity and perceptual photorealism. Technically, we model the transition between PBR and PRR images as a distribution transfer and aim to learn a distribution transfer vector field (DTV Field) to guide this process. To define the learning objective, we first leverage a multi-agent VLM framework to construct an expert-guided pairwise P2P transfer dataset, named P2P-50K, where each paired sample in the dataset corresponds to a transfer vector in the DTV Field. Subsequently, we propose a multi-condition ControlNet to learn the DTV Field, which synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions.
[85] VAGNet: Vision-based accident anticipation with global features cs.CVPDF
Vipooshan Vipulananthan, Charith D. Chitraranjan
TL;DR: 本文提出了一种名为VAGNet的深度神经网络,用于从行车记录仪视频中预测交通事故。该方法利用交通场景的全局特征,无需显式的物体级特征提取,从而降低了计算复杂度。网络结合了Transformer和图模块,并采用VideoMAE-V2视觉基础模型进行全局特征提取。在四个基准数据集上的实验表明,该方法在平均精度和平均事故前时间方面优于现有方法,同时计算效率更高。
Details
Motivation: 交通事故是全球伤亡的主要原因,因此提前预测危险情况至关重要。现有方法通常需要从每个检测到的物体中提取特征,计算量大,难以满足实时性要求。本文旨在开发一种基于全局特征的事故预测方法,以降低计算成本并提高实时性能。
Result: 在DAD、DoTA、DADA和Nexar四个基准数据集上的实验表明,VAGNet在平均精度和平均事故前时间指标上优于现有方法,同时计算效率更高,实现了更高的性能与效率平衡。
Insight: 创新点在于摒弃了传统的物体级特征提取,转而利用视觉基础模型提取的全局场景特征进行事故预测,结合Transformer和图神经网络模块,在保持高精度的同时显著提升了计算效率,为实时驾驶辅助系统提供了可行的解决方案。
Abstract: Traffic accidents are a leading cause of fatalities and injuries across the globe. Therefore, the ability to anticipate hazardous situations in advance is essential. Automated accident anticipation enables timely intervention through driver alerts and collision avoidance maneuvers, forming a key component of advanced driver assistance systems. In autonomous driving, such predictive capabilities support proactive safety behaviors, such as initiating defensive driving and human takeover when required. Using dashcam video as input offers a cost-effective solution, but it is challenging due to the complexity of real-world driving scenes. Accident anticipation systems need to operate in real-time. However, current methods involve extracting features from each detected object, which is computationally intensive. We propose VAGNet, a deep neural network that learns to predict accidents from dash-cam video using global features of traffic scenes without requiring explicit object-level features. The network consists of transformer and graph modules, and we use the vision foundation model VideoMAE-V2 for global feature extraction. Experiments on four benchmark datasets (DAD, DoTA, DADA, and Nexar) show that our method anticipates accidents with higher average precision and mean time-to-accident while being computationally more efficient compared to existing methods.
[86] Structure-Aware Fine-Grained Gaussian Splatting for Expressive Avatar Reconstruction cs.CVPDF
Yuze Su, Hongsong Wang, Jie Gui, Liang Wang
TL;DR: 本文提出了一种名为结构感知细粒度高斯泼溅(SFGS)的新方法,用于从单目视频序列中重建富有表现力且连贯的全身3D人体化身。该方法利用空间三平面和时间感知六平面来捕捉连续帧间的动态特征,并通过结构感知高斯模块和基于细粒度手部重建的残差细化模块,以提升姿态和纹理表达,特别是手部变形等细节。该方法仅需单阶段训练,在定量和定性评估中均优于现有最先进基线。
Details
Motivation: 从单目视频中重建具有照片级真实感且拓扑感知的人体化身仍是一个重大挑战。现有3D人体化身建模方法虽能有效捕捉身体运动,但往往难以准确建模手部动作和面部表情等精细细节。
Result: 该方法在定量和定性评估中均优于最先进的基线方法,能够生成具有自然运动和精细细节的高保真化身。
Insight: 主要创新点包括:1)结合使用空间三平面和时间感知六平面来编码动态特征;2)设计结构感知高斯模块,以空间连贯的方式捕捉姿态依赖的细节;3)提出基于细粒度手部重建的残差细化模块,以更好地建模手部变形。从客观角度看,该方法将动态特征编码与针对关键身体部位(如手部)的专门细化模块相结合,有望在保持整体连贯性的同时提升局部细节的建模能力。
Abstract: Reconstructing photorealistic and topology-aware human avatars from monocular videos remains a significant challenge in the fields of computer vision and graphics. While existing 3D human avatar modeling approaches can effectively capture body motion, they often fail to accurately model fine details such as hand movements and facial expressions. To address this, we propose Structure-aware Fine-grained Gaussian Splatting (SFGS), a novel method for reconstructing expressive and coherent full-body 3D human avatars from a monocular video sequence. The SFGS use both spatial-only triplane and time-aware hexplane to capture dynamic features across consecutive frames. A structure-aware gaussian module is designed to capture pose-dependent details in a spatially coherent manner and improve pose and texture expression. To better model hand deformations, we also propose a residual refinement module based on fine-grained hand reconstruction. Our method requires only a single-stage training and outperforms state-of-the-art baselines in both quantitative and qualitative evaluations, generating high-fidelity avatars with natural motion and fine details. The code is on Github: https://github.com/Su245811YZ/SFGS
[87] From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection cs.CVPDF
Narges Rashvand, Shanle Yao, Armin Danesh Pazho, Babak Rahimi Ardabili, Hamed Tabkhi
TL;DR: 该论文指出基于姿态的视频异常检测(VAD)中传统帧级评估方法的局限性,并提出转向以事件为中心的评估视角。作者首先分析了多个VAD基准数据集的事件结构,然后提出了两种时间事件定位策略:一种基于分层高斯平滑和自适应二值化的分数优化流程,以及一种直接生成事件级检测的端到端双分支模型。最后,他们通过采用时间动作定位指标(如基于tIoU的事件匹配和多阈值F1评估)建立了首个基于事件的VAD评估标准。实验结果显示,在事件级评估下,现有最先进模型的性能显著下降,凸显了帧级与事件级评估之间的巨大差距。
Details
Motivation: 传统帧级评估将视频视为孤立帧的集合,这与现实世界中异常以连续事件形式出现并被处理的方式不符。在监控系统中,重要的是对连贯异常事件(具有可识别起始和持续时间的连续时间片段)的可靠检测、定位和报告,而非单个帧的标记。帧级指标无法区分这一点,导致其系统性地高估了模型在需要可操作事件级警报的实际部署中的性能。
Result: 在NWPUC数据集上,所有最先进(SoTA)模型的帧级AUC-ROC均超过52%,但在事件级定位精度方面,即使在最低tIoU=0.2时也低于10%,且所有阈值下的平均事件级F1分数仅为0.11。这量化了帧级与事件级评估之间的巨大性能差距。
Insight: 论文的核心创新在于将VAD的评估视角从帧级转向事件级,并建立了首个基于事件的VAD评估标准。这揭示了现有帧级评估的严重局限性,并强调了开发能够直接处理连续异常事件的模型的重要性。提出的两种事件定位策略(分数优化流程和端到端双分支模型)为未来研究提供了可行的技术方向。从客观角度看,这项工作推动了VAD领域评估范式的转变,使其更贴近实际应用需求,具有重要的方法论意义。
Abstract: Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at https://github.com/TeCSAR-UNCC/EventCentric-VAD.
[88] EpiAgent: An Agent-Centric System for Ancient Inscription Restoration cs.CVPDF
Shipeng Zhu, Ang Chen, Na Nie, Pengfei Fang, Min-Ling Zhang
TL;DR: 本文提出EpiAgent,一个以智能体为中心的系统,用于修复古代碑文。该系统受到人类碑铭学家工作流程的启发,将修复任务构建为分层规划问题,通过一个基于LLM的中心规划器协调多模态分析、历史经验、专业修复工具和迭代自我优化,实现了比传统方法更灵活、适应性更强的修复过程。
Details
Motivation: 古代碑文作为文化遗产,长期遭受环境与人为破坏,其视觉与文本完整性的修复是数字遗产保护中的重大挑战。现有基于AI的方法通常采用僵化的流程,难以泛化到复杂多样的真实世界退化情况。
Result: 在真实世界的退化碑文上,EpiAgent相比现有方法取得了更优的修复质量和更强的泛化能力。
Insight: 创新点在于将修复任务形式化为分层规划问题,并采用“观察-构思-执行-再评估”的智能体协作范式。其核心是以LLM作为中心规划器,协调多模态专家智能体,实现了类似人类专家的、灵活且可迭代的修复流程,超越了传统的单次前向处理方法。
Abstract: Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.
[89] SynFlow: Scaling Up LiDAR Scene Flow Estimation with Synthetic Data cs.CVPDF
Qingwen Zhang, Xiaomeng Zhu, Chenhan Jiang, Patric Jensfelt
TL;DR: 本文提出了SynFlow,一个用于生成大规模合成LiDAR场景流数据的数据生成流水线,旨在解决真实世界密集高质量运动标注数据稀缺的问题。通过合成包含约94万帧的SynFlow-4k数据集,该工作展示了仅使用合成数据训练的模型在多个真实世界基准测试上具有出色的零样本泛化能力,并在微调少量真实标注后实现高效标签利用。
Details
Motivation: 可靠的3D动态感知需要能够预测超出预定义类别运动的模型,但密集高质量运动标注的稀缺阻碍了进展。现有无监督方法因噪声代理信号难以通过扩大无标签数据缩小性能差距,因此本文转向从可扩展的仿真中学习鲁棒的真实世界运动先验。
Result: 在零样本设置下,仅使用SynFlow-4k合成数据训练的模型在nuScenes基准测试上与领域内监督基线相当,在TruckScenes基准测试上以31.8%的优势超越最先进方法。此外,仅使用5%的真实世界标签进行微调,即可超越使用全部可用标签从头训练的模型。
Insight: 论文的核心创新在于提出了一个以运动为导向的合成数据生成策略,优先考虑多样化的运动学模式而非传感器特定的真实感,从而生成了大规模、领域不变的场景流先验数据。这为通用3D运动估计研究提供了一条不依赖大量真实标注的新路径,并证明了合成数据在标签效率和高泛化性方面的潜力。
Abstract: Reliable 3D dynamic perception requires models that can anticipate motion beyond predefined categories, yet progress is hindered by the scarcity of dense, high-quality motion annotations. While self-supervision on unlabeled real data offers a path forward, empirical evidence suggests that scaling unlabeled data fails to close the performance gap due to noisy proxy signals. In this paper, we propose a shift in paradigm: learning robust real-world motion priors entirely from scalable simulation. We introduce SynFlow, a data generation pipeline that generates large-scale synthetic dataset specifically designed for LiDAR scene flow. Unlike prior works that prioritize sensor-specific realism, SynFlow employs a motion-oriented strategy to synthesize diverse kinematic patterns across 4,000 sequences ($\sim$940k frames), termed SynFlow-4k. This represents a 34x scale-up in annotated volume over existing real-world benchmarks. Our experiments demonstrate that SynFlow-4k provides a highly domain-invariant motion prior. In a zero-shot regime, models trained exclusively on our synthetic data generalize across multiple real-world benchmarks, rivaling in-domain supervised baselines on nuScenes and outperforming state-of-the-art methods on TruckScenes by 31.8%. Furthermore, SynFlow-4k serves as a label-efficient foundation: fine-tuning with only 5% of real-world labels surpasses models trained from scratch on the full available budget. We open-source the pipeline and dataset to facilitate research in generalizable 3D motion estimation. More detail can be found at https://kin-zhang.github.io/SynFlow.
[90] PhysInOne: Visual Physics Learning and Reasoning in One Suite cs.CV | cs.AI | cs.LG | cs.ROPDF
Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang
TL;DR: PhysInOne是一个大规模合成数据集,旨在解决AI系统物理基础训练数据严重匮乏的问题。它包含200万个视频,覆盖153,810个动态3D场景和71种基础物理现象,提供全面的真实标注,并在物理感知视频生成、未来帧预测、物理属性估计和运动迁移等应用中验证了其有效性。
Details
Motivation: 现有数据集通常仅有数百或数千个样本,无法满足AI系统学习复杂物理动态的需求,因此需要构建一个规模更大、标注更全面的物理基础数据集。
Result: 在PhysInOne上微调基础模型显著提升了物理合理性,同时暴露了现有模型在建模复杂物理动态和估计内在属性方面的关键差距,为生成、模拟和具身AI中的物理基础世界模型设立了新基准。
Insight: 创新点在于构建了首个百万级规模的物理合成数据集,涵盖多物体交互和复杂背景,并提供3D几何、语义、动态运动、物理属性和文本描述等多模态标注,为物理推理研究提供了统一且全面的评估平台。
Abstract: We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne’s efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.
[91] Do Vision Language Models Need to Process Image Tokens? cs.CVPDF
Sambit Ghosh, R. Venkatesh Babu, Chirag Agarwal
TL;DR: 这篇论文系统研究了视觉语言模型(VLMs)中图像令牌处理的功能角色,发现视觉表示在早期层迅速收敛到有界复杂度状态,而文本表示则在深度上持续重构。研究表明,视觉处理的必要性是任务依赖的,深层视觉处理并非普遍必需,挑战了当前多模态LLM架构的范式。
Details
Motivation: 动机是探究VLMs中持续处理密集图像令牌的必要性,以及视觉表示从早期到深层是否发生有意义的演变,以理解其计算开销是否合理。
Result: 实验表明,视觉表示的熵、内在维度和轨迹曲率在早期层稳定,而文本表示持续变化;视觉截断分析显示,单令牌预测对视觉深度截断相对鲁棒,但多令牌生成需要持续访问视觉表示。
Insight: 创新点在于揭示了视觉表示在VLMs中快速收敛的特性,挑战了深层视觉处理普遍必要的假设,为设计更高效的多模态架构提供了理论依据。
Abstract: Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of image tokens in VLMs and show that visual representations rapidly converge to a bounded-complexity regime, \ie their entropy stabilizes, intrinsic dimensionality compresses, and trajectory curvature approaches a near-constant profile. In contrast, textual representations continue to undergo substantial restructuring across depth. Once stabilized, visual representations become largely interchangeable between layers, indicating limited additional transformation in deeper stages. Further, depth-wise visual truncation reveals that the necessity of visual processing is task-dependent, where single-token predictions remain comparatively robust to truncated visual depth, but multi-token generation require sustained access to visual representations. Under deterministic decoding, reducing visual depth perturbs intermediate reasoning trajectories more strongly than final outputs, suggesting that image tokens influence the structure of reasoning more than the ultimate conclusions. Collectively, these findings \textbf{question the assumption} that deeper visual processing is uniformly essential in VLMs, challenging the current paradigm of multimodal LLM architectures.
[92] Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories cs.CV | cs.AI | cs.LGPDF
Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng
TL;DR: 本文提出了一种名为’Rays as Pixels’的视频扩散模型,该模型能够联合学习视频和相机轨迹的分布。通过将相机表示为密集光线像素,并利用解耦自交叉注意力机制与视频帧联合去噪,该模型能够统一处理从视频预测相机轨迹、从输入图像联合生成视频与相机轨迹,以及沿目标轨迹生成视频这三个任务。
Details
Motivation: 传统计算机视觉和图形学中将从图像恢复相机参数和从新视角渲染场景视为独立任务,但在图像覆盖稀疏或姿态模糊的情况下,这种分离会失效,因为每个任务都需要对方的结果。本文旨在通过一个统一的模型来弥合这一鸿沟。
Result: 论文在姿态估计和相机控制视频生成任务上报告了结果,并通过一个闭环自一致性测试进行评估,证明了其前向预测(生成)和逆向预测(估计)结果的一致性。轨迹预测所需的去噪步骤远少于视频生成。
Insight: 核心创新在于提出了一个联合学习视频和相机轨迹分布的扩散模型框架,将相机参数表示为’光线像素’并与视频帧联合处理。其解耦自交叉注意力机制和闭环自一致性评估方法具有借鉴意义,为解决稀疏视图下的三维重建和生成问题提供了新思路。
Abstract: Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.
[93] Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement cs.CVPDF
Zhengxian Yang, Shengqi Wang, Shi Pan, Hongshuai Li, Haoxiang Wang
TL;DR: 本文提出了一种名为沉浸式体视频的新型体媒体格式,旨在为虚拟现实提供大范围6自由度交互空间、视听反馈以及高分辨率高帧率的动态内容。为实现其构建,作者创建了多视图多模态数据集ImViD,并开发了一个基于高斯时空表示的动态光场重建框架,以及首个从多视图视听数据重建声场的方法,形成了一个统一的沉浸式体视频制作流程。
Details
Motivation: 目前,完全沉浸式的6自由度视听交互体验主要通过计算机生成内容实现,而从真实世界捕获视频直接构建此类体验仍未被充分探索。本文旨在解决这一空白,探索从真实捕获数据构建沉浸式体视频的方法。
Result: 在广泛的基准测试和沉浸式VR实验中,本文提出的流程生成了高质量、时间稳定的视听体内容,并提供了大的6自由度交互空间。其构建的ImViD数据集在空间、时间和多模态覆盖上比现有基准更丰富。
Insight: 创新点包括:1)定义了沉浸式体视频这一新格式;2)提出了以空间为导向的捕获理念,并构建了相应的多模态数据集;3)开发了基于高斯表示的动态光场重建框架,引入了流引导稀疏初始化、联合相机时间标定和多目标时空监督;4)首次提出了从多视图视听数据重建声场的方法。这些共同构成了一个统一的端到端制作流程。
Abstract: Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground–background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.
[94] Incremental Semantics-Aided Meshing from LiDAR-Inertial Odometry and RGB Direct Label Transfer cs.CV | cs.ROPDF
Muhammad Affan, Ville Lehtola, George Vosselman
TL;DR: 本文提出了一种模块化、增量式的RGB+LiDAR流程,用于在大型复杂室内环境中从LiDAR惯性扫描重建高质量网格。该方法通过基于扫描帧的直接标签传递,利用视觉基础模型标注RGB帧,并将语义标签增量式投影融合到LiDAR惯性里程计地图中,再通过语义感知的截断符号距离函数融合和移动立方体算法生成最终网格。
Details
Motivation: 解决在大型复杂室内环境(如文化建筑)中,由于点云稀疏性、几何漂移和固定融合参数导致网格重建出现空洞、过度平滑和虚假边界表面等问题。
Result: 在Oxford Spires数据集上的定量几何评估表明,该方法在几何重建质量上优于最先进的几何基线方法ImMesh和Voxblox;在NTU VIRAL数据集上进行了定性分析。
Insight: 创新点在于提出了一种帧级的语义辅助融合策略,在保持LiDAR几何保真度的同时,利用丰富的视觉语义信息来解决因点云稀疏和几何漂移引起的重建边界模糊问题,从而生成带有语义标签的高质量网格,适用于XR和数字建模。
Abstract: Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments – such as cultural buildings – where point cloud sparsity, geometric drift, and fixed fusion parameters produce holes, over-smoothing, and spurious surfaces at structural boundaries. We propose a modular, incremental RGB+LiDAR pipeline that generates incremental semantics-aided high-quality meshes from indoor scans through scan frame-based direct label transfer. A vision foundation model labels each incoming RGB frame; labels are incrementally projected and fused onto a LiDAR-inertial odometry map; and an incremental semantics-aware Truncated Signed Distance Function (TSDF) fusion step produces the final mesh via marching cubes. This frame-level fusion strategy preserves the geometric fidelity of LiDAR while leveraging rich visual semantics to resolve geometric ambiguities at reconstruction boundaries caused by LiDAR point-cloud sparsity and geometric drift. We demonstrate that semantic guidance improves geometric reconstruction quality; quantitative evaluation is therefore performed using geometric metrics on the Oxford Spires dataset, while results from the NTU VIRAL dataset are analyzed qualitatively. The proposed method outperforms state-of-the-art geometric baselines ImMesh and Voxblox, demonstrating the benefit of semantics-aided fusion for geometric mesh quality. The resulting semantically labelled meshes are of value when reconstructing Universal Scene Description (USD) assets, offering a path from indoor LiDAR scanning to XR and digital modeling.
[95] VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning cs.CV | cs.AIPDF
Yucheng Shen, Jiulong Wu, Jizhou Huang, Dawei Yin, Lingyong Yan
TL;DR: 本文提出了VISOR,一个用于视觉检索增强生成的单智能体框架,旨在解决复杂视觉查询中的多步推理问题。VISOR通过结构化证据空间、视觉动作评估与校正机制以及动态轨迹管理,有效应对了现有方法中视觉证据稀疏和长时搜索漂移的瓶颈。
Details
Motivation: 现有基于智能体的视觉检索增强生成系统面临两个关键瓶颈:视觉证据稀疏(关键证据分散在不同页面且被孤立处理,阻碍跨页推理;细粒度图像内证据需要精确的视觉动作,误用会降低检索质量)和长时搜索漂移(跨页面累积的视觉令牌会稀释上下文并导致认知过载,使智能体偏离搜索目标)。
Result: 在ViDoSeek、SlideVQA和MMLongBench等基准上的大量实验表明,VISOR在长时视觉推理任务中实现了最先进的性能,并具有优越的效率。
Insight: 论文的创新点包括:1)一个用于渐进式跨页推理的结构化证据空间;2)一个用于管理视觉动作的视觉动作评估与校正机制;3)一个结合滑动窗口和意图注入的动态轨迹策略,以缓解搜索漂移;4)一个基于组相对策略优化的强化学习训练流程,包含状态掩码和信用分配,专为动态上下文重建设计。这些机制共同解决了视觉证据整合和长序列推理中的上下文管理难题。
Abstract: Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.
[96] RIRF: Reasoning Image Restoration Framework cs.CVPDF
Wending Yan, Rongkai Zhang, Kaihua Tang, Yu Cheng, Qiankun Liu
TL;DR: 本文提出了一个名为R&R(Reason and Restore)的新型通用图像修复框架,该框架将结构化的思维链推理集成到修复流程中。它通过一个显式的推理器(基于微调的Qwen3-VL)来诊断退化类型、量化退化严重程度、推断关键退化相关因素并描述场景语义,从而为修复器提供可解释的细粒度先验。此外,量化后的退化严重程度被用作强化学习信号来指导修复器,实现了语义诊断推理与像素级修复的紧密耦合。在多个通用图像修复基准测试中,R&R取得了最先进的性能。
Details
Motivation: 现有通用图像修复方法主要关注像素重建,缺乏在修复前对退化构成、严重程度和场景语义进行显式诊断推理的能力。本文旨在解决这一问题,将结构化推理引入修复流程以提供可解释的先验并提升修复质量。
Result: 在多个通用图像修复基准测试上进行了广泛实验,结果表明R&R取得了最先进的性能。
Insight: 主要创新点在于将结构化的思维链推理显式地集成到图像修复流程中,通过一个专门的推理器生成可解释的诊断先验,并利用量化后的退化严重程度作为强化学习信号来指导修复过程。这实现了语义级推理与低级视觉任务(像素修复)的紧密耦合,不同于现有基于多模态LLM的代理系统通常将两者解耦的做法,从而在提升性能的同时提供了独特的可解释性。
Abstract: Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R&R), a novel framework that integrates structured Chain-of-Thought (CoT) reasoning into the image restoration pipeline. R&R introduces an explicit reasoner, implemented by fine-tuning Qwen3-VL, to diagnose degradation types, quantify degradation severity, infer key degradation-related factors, and describe relevant scene and object semantics. The resulting structured reasoning provides interpretable and fine-grained diagnostic priors for the restorer. To further improve restoration quality, the quantified degradation severity produced by the reasoner is leveraged as reinforcement learning (RL) signals to guide and strengthen the restorer. Unlike existing multimodal LLM-based agentic systems that decouple reasoning from low-level vision tasks, R&R tightly couples semantic diagnostic reasoning with pixel-level restoration in a unified framework. Extensive experiments across diverse UIR benchmarks demonstrate that R&R achieves state-of-the-art performance while offering unique interpretability into the restoration process.
[97] Envisioning the Future, One Step at a Time cs.CV | cs.AI | cs.LGPDF
Stefan Andreas Baumann, Jannik Wiese, Tommaso Martorella, Mahdi M. Kalayeh, Björn Ommer
TL;DR: 这篇论文提出了一种基于稀疏点轨迹的自回归扩散模型,用于从单张图像预测开放集未来场景动态。该方法通过逐步推理稀疏轨迹来模拟长期、多模态的运动,能够高效生成数千种不同的未来假设,同时保持物理合理性和长期一致性。
Details
Motivation: 现有方法大多依赖密集视频或潜在空间预测,将大量计算能力耗费在密集外观而非场景中稀疏的点轨迹上,导致大规模未来假设探索成本高昂,且在长期、多模态运动预测中性能受限。
Result: 在提出的开放集运动预测基准OWM上,该方法在预测准确性上匹配或超越了密集模拟器,同时实现了数量级更高的采样速度。
Insight: 创新点在于将未来场景动态预测形式化为对稀疏点轨迹的逐步推理,采用以动力学为中心的表示,通过自回归扩散模型处理不确定性增长,实现了高效且可扩展的开放集未来预测。
Abstract: Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.
[98] Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise cs.CV | cs.AIPDF
Zibin Geng, Xuefeng Jiang, Jia Li, Zheng Li, Tian Wen
TL;DR: 本文提出VisPrompt,一种轻量且鲁棒的视觉引导提示学习框架,用于处理标签噪声场景。该方法利用跨模态注意力机制将视觉语义反向注入提示表示,并通过条件调制机制自适应控制视觉信息注入强度,从而在文本侧语义先验和图像侧实例证据之间实现更鲁棒的平衡,有效抑制噪声干扰并减少对错误标注样本的记忆。
Details
Motivation: 提示学习是一种参数高效的视觉语言模型适应方法,但其在标签噪声下的鲁棒性研究不足。视觉内容包含更丰富可靠的语义信息,在标签噪声下更鲁棒,而提示本身极易受噪声影响。
Result: 在合成和真实世界标签噪声下的广泛实验表明,VisPrompt在七个基准数据集上普遍优于现有基线,实现了更强的鲁棒性。
Insight: 核心创新在于利用跨模态注意力反向注入视觉语义来稳定提示学习,并引入轻量级条件调制机制自适应调节注入强度,从而将提示学习锚定在稳定的实例级视觉证据上,减少噪声监督的影响。
Abstract: Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.
[99] EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks cs.CVPDF
Lulin Liu, Dayou Li, Yiqing Liang, Sicong Jiang, Hitesh Vijay
TL;DR: 本文提出了EgoTL,一种用于长时程任务的自我中心思维链数据采集与标注框架,旨在解决现有视觉语言模型在长时程空间指令跟随任务中因缺乏精确的人类动作标签、思维链和空间标注而产生的噪声问题。
Details
Motivation: 现有VLM在长时程家庭任务规划中,由于缺乏覆盖分钟级日常任务的精确标注和空间基础,导致推理链和世界模型合成时出现物体幻觉、步骤跳过或物理属性不准确等问题。
Result: 基于EgoTL构建的数据集,在超过100个日常家庭任务上对VLMs和世界模型进行了六个维度的基准测试,发现基础模型仍难以胜任自我中心助手或开放世界模拟器;使用EgoTL训练集微调基础模型后,在长时程规划与推理、逐步推理、指令跟随和空间基础方面均有提升。
Insight: 创新点在于设计了一个’先言后行’的思维链捕获流程,结合词级时间戳、度量级空间估计器、场景上下文记忆库遍历和片段级标签,为长时程自我中心任务提供了高质量、多层次的标注数据,从而改善了基础模型的规划与空间推理能力。
Abstract: Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.
[100] Tango: Taming Visual Signals for Efficient Video Large Language Models cs.CVPDF
Shukang Yin, Sirui Zhao, Hanchao Wang, Baozhi Jia, Xianquan Wang
TL;DR: 本文提出Tango框架,通过改进视频大语言模型中的token剪枝策略来提升效率。该框架结合了多样性驱动的注意力选择与时空旋转位置嵌入,以优化视觉信号利用,在仅保留10%视频token的情况下保持接近原始性能并加速推理。
Details
Motivation: 现有视频大语言模型的token剪枝方法存在两个关键局限:注意力分布的多模态和长尾特性未被充分考虑,以及相似性聚类易产生碎片化簇导致表示失真。
Result: 在多种视频大语言模型和视频理解基准测试中,Tango在仅保留10%视频token时,在LLaVA-OV上保持了98.9%的原始性能,并实现了1.88倍的推理加速。
Insight: 创新点包括引入多样性策略改进注意力选择,以及提出时空旋转位置嵌入来通过局部先验保持几何结构,从而更高效地压缩视觉token而不显著损失信息。
Abstract: Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88x inference speedup.
eess.IV [Back]
[101] MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification eess.IV | cs.AI | cs.CV | cs.LGPDF
Mohammed Maaz Sibhai, Abedalrhman Alkhateeb, Saad B. Ahmed
TL;DR: 本文提出了MedFormer-UR,一种用于医学图像分类的不确定性路由Transformer模型。该模型通过整合基于原型的学习和不确定性引导的路由机制,并利用狄利克雷分布对每个token进行证据不确定性量化,以实时定位和量化模型预测中的模糊性。在四种医学影像模态(乳腺X光、超声、MRI和组织病理学)上的测试表明,该方法显著提升了模型校准度,并将预期校准误差(ECE)降低了高达35%,同时改善了选择性预测性能。
Details
Motivation: 为了解决当前医学视觉Transformer模型在临床应用中存在的过度自信预测和缺乏透明度的问题,尤其是在处理噪声大、不平衡的临床数据时,模型需要提供可靠的不确定性量化以确保临床集成的安全性。
Result: 在四种医学影像模态(乳腺X光、超声、MRI和组织病理学)的测试中,该方法将预期校准误差(ECE)降低了高达35%,并改善了选择性预测性能,即使准确率提升有限。
Insight: 创新点在于将不确定性不仅作为输出,还作为训练过程中的主动参与者,通过过滤不可靠的特征更新来提升模型鲁棒性;同时,利用类别特定的原型确保嵌入空间的结构化,支持基于视觉相似性的决策。从客观角度看,该研究将证据深度学习和原型学习与Transformer结合,为医学图像分析提供了更可靠和可解释的不确定性量化框架。
Abstract: To ensure safe clinical integration, deep learning models must provide more than just high accuracy; they require dependable uncertainty quantification. While current Medical Vision Transformers perform well, they frequently struggle with overconfident predictions and a lack of transparency, issues that are magnified by the noisy and imbalanced nature of clinical data. To address this, we enhanced the modified Medical Transformer (MedFormer) that incorporates prototype-based learning and uncertainty-guided routing, by utilizing a Dirichlet distribution for per-token evidential uncertainty, our framework can quantify and localize ambiguity in real-time. This uncertainty is not just an output but an active participant in the training process, filtering out unreliable feature updates. Furthermore, the use of class-specific prototypes ensures the embedding space remains structured, allowing for decisions based on visual similarity. Testing across four modalities (mammography, ultrasound, MRI, and histopathology) confirms that our approach significantly enhances model calibration, reducing expected calibration error (ECE) by up to 35%, and improves selective prediction, even when accuracy gains are modest.
[102] AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer eess.IV | cs.CVPDF
Gautier Hénique, William Le, Gabriel Dayan, Coralie Brodeur, Kristoff Nelson
TL;DR: 本文提出了一种名为AMO-ENE的、基于注意力的多组学融合模型,用于预测人乳头瘤病毒(HPV)相关口咽癌(OPC)中结外侵犯(ENE)状态及治疗结果。该模型是一个全自动端到端流程,结合了放疗计划CT图像和临床数据,首先通过分层3D半监督分割模型检测和描绘淋巴结ENE,然后提取影像组学和深度特征用于ENE分级分类,最后将淋巴结特征与原发肿瘤特征在多模态注意力框架下融合进行预后预测。
Details
Motivation: 当前,结外侵犯(ENE)作为HPV阳性口咽癌的重要预后因素,其临床整合面临诸多挑战,包括分割不一致、CT影像中淋巴结周边对比度低以及人工标注繁琐等。本研究旨在克服这些限制,实现ENE状态的自动化评估和预后预测。
Result: 在包含397名HPV阳性OPC患者的内部队列中验证,模型在2年预后预测上表现优异:在转移复发、总生存期和无病生存期的预测上,AUC分别达到88.2%、79.2%和78.1%,一致性指数(C-index)分别达到83.3%、71.3%和70.0%,超越了基线模型,显示出临床决策的可行性。
Insight: 创新点在于构建了一个全自动、端到端的多模态预测流程,将半监督淋巴结分割、影像组学/深度特征提取与基于注意力的多组学(影像与临床)融合框架相结合,为ENE状态评估和癌症预后预测提供了一个动态、综合的解决方案。
Abstract: Extranodal extension (ENE) is an emerging prognostic factor in human papillomavirus (HPV)-associated oropharyngeal cancer (OPC), although it is currently omitted as a clinical staging criteria. Recent works have advocated for the inclusion of iENE as a prognostic marker in HPV-positive OPC staging. However, several practical limitations continue to hinder its clinical integration, including inconsistencies in segmentation, low contrast in the periphery of metastatic lymph nodes on CT imaging, and laborious manual annotations. To address these limitations, we propose a fully automated end-to-end pipeline that uses computed tomography (CT) images with clinical data to assess the status of nodal ENE and predict treatment outcomes. Our approach includes a hierarchical 3D semi-supervised segmentation model designed to detect and delineate relevant iENE from radiotherapy planning CT scans. From these segmentations, a set of radiomics and deep features are extracted to train an imaging-detected ENE grading classifier. The predicted ENE status is then evaluated for its prognostic value and compared with existing staging criteria. Furthermore, we integrate these nodal features with primary tumor characteristics in a multimodal, attention-based outcome prediction model, providing a dynamic framework for outcome prediction. Our method is validated in an internal cohort of 397 HPV-positive OPC patients treated with radiation therapy or chemoradiotherapy between 2009 and 2020. For outcome prediction at the 2-year mark, our pipeline surpassed baseline models with 88.2% (4.8) in AUC for metastatic recurrence, 79.2% (7.4) for overall survival, and 78.1% (8.6) for disease-free survival. We also obtain a concordance index of 83.3% (6.5) for metastatic recurrence, 71.3% (8.9) for overall survival, and 70.0% (8.1) for disease-free survival, making it feasible for clinical decision making.
[103] Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding Application eess.IV | cs.CV | cs.MMPDF
Junqi Liu, Yun Zhang, Xiaoxia Huang, Long Xu, Weisi Lin
TL;DR: 本文针对面向机器的视频编码(VCM)中的可识别差异(JRD)建模问题,提出了一种多任务JRD(MT-JRD)数据集和一种属性辅助的MT-JRD(AMT-JRD)预测模型,以提升多任务场景下的预测精度和编码效率。
Details
Motivation: 现有JRD建模仅适用于单任务场景,限制了其在面向机器的视频编码中的效率提升潜力。本文旨在解决多任务机器视觉(如目标检测、实例分割和关键点检测)下的JRD预测问题。
Result: AMT-JRD模型在三个任务上的平均绝对误差为3.781,误差方差为5.332,分别比最先进的单任务预测模型提升了6.7%和6.3%。在编码应用中,基于AMT-JRD的VCM相比基线VVC和JPEG,平均BD-mAP分别提升了3.861%和7.886%。
Insight: 创新点包括:1) 构建了首个支持多任务(目标检测、实例分割、关键点检测)的MT-JRD数据集;2) 提出了AMT-JRD模型,通过GFEM和SFEM模块实现多任务联合学习;3) 创新性地通过AFFM模块将目标属性(如大小和位置)作为先验知识融入JRD预测,弥补了仅依赖图像特征的不足,更好地建模了机器视觉的感知机制。
Abstract: Just Recognizable Difference (JRD) boosts coding efficiency for machine vision through visibility threshold modeling, but is currently limited to a single-task scenario. To address this issue, we propose a Multi-Task JRD (MT-JRD) dataset and an Attribute-assisted MT-JRD (AMT-JRD) model for Video Coding for Machines (VCM), enhancing both prediction accuracy and coding efficiency. First, we construct a dataset comprising 27,264 JRD annotations from machines, supporting three representative tasks including object detection, instance segmentation, and keypoint detection. Secondly, we propose the AMT-JRD prediction model, which integrates Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) to facilitate joint learning across multiple tasks. Thirdly, we innovatively incorporate object attribute information into object-wise JRD prediction through the Attribute Feature Fusion Module (AFFM), which introduces prior knowledge about object size and location. This design effectively compensates for the limitations of relying solely on image features and enhances the model’s capacity to represent the perceptual mechanisms of machine vision. Finally, we apply the AMT-JRD model to VCM, where the accurately predicted JRDs are applied to reduce the coding bit rate while preserving accuracy across multiple machine vision tasks. Extensive experimental results demonstrate that AMT-JRD achieves precise and robust multi-task prediction with a mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming the state-of-the-art single-task prediction model by 6.7% and 6.3%, respectively. Coding experiments further reveal that compared to the baseline VVC and JPEG, the AMT-JRD-based VCM improves an average of 3.861% and 7.886% Bjontegaard Delta-mean Average Precision (BD-mAP), respectively.
[104] DSVTLA: Deep Swin Vision Transformer-Based Transfer Learning Architecture for Multi-Type Cancer Histopathological Cancer Image Classification eess.IV | cs.CVPDF
Muazzem Hussain Khan, Tasdid Hasnain, Md. Jamil khan, Ruhul Amin, Md. Shamim Reza
TL;DR: 本研究提出了一种基于深度Swin Vision Transformer的迁移学习架构(DSVTLA),用于鲁棒的多类型癌症组织病理学图像分类。该框架结合了分层Swin Transformer和基于ResNet50的卷积特征提取,以捕捉图像中的长程上下文依赖性和细粒度局部形态模式。在包含乳腺癌、口腔癌、肺癌与结肠癌、肾癌以及急性淋巴细胞白血病(ALL)的多癌症数据集上进行了广泛实验,评估了模型在异构临床成像条件下的鲁棒性。实验结果表明,该架构在多个癌症分类任务上取得了优异性能,例如在肺癌-结肠癌和分割白血病数据集上达到100%测试准确率,在乳腺癌分类上达到99.23%准确率,展现了高精度、稳定性和可解释性。
Details
Motivation: 解决多类型癌症组织病理学图像分类中,传统方法难以同时捕获长程上下文依赖和局部精细形态特征的问题,旨在开发一个鲁棒、准确的AI辅助诊断系统。
Result: 在包含多种癌症的综合数据集上,与DenseNet121、DenseNet201、InceptionV3、ResNet50、EfficientNetB3、多种ViT变体和Swin Transformer等SOTA模型进行基准比较,使用统一流程进行训练和验证。所提架构在肺癌-结肠癌和分割白血病数据集上达到100%测试准确率,在乳腺癌分类上达到99.23%准确率,并实现了接近完美的精确率、F1分数和召回率,表明其在不同癌症类型上具有高度稳定的性能。
Insight: 创新点在于将分层Swin Transformer与ResNet50卷积特征提取相结合,以同时建模全局上下文和局部细节;从客观角度看,该研究提供了一个统一的迁移学习框架,通过平衡数据预处理、迁移学习和微调策略,在多癌症分类任务中实现了SOTA性能,为可靠AI辅助病理诊断和临床决策设计提供了有价值的基准和比较评估。
Abstract: In this study, we proposed a deep Swin-Vision Transformer-based transfer learning architecture for robust multi-cancer histopathological image classification. The proposed framework integrates a hierarchical Swin Transformer with ResNet50-based convolution features extraction, enabling the model to capture both long-range contextual dependencies and fine-grained local morphological patterns within histopathological images. To validate the efficiency of the proposed architecture, an extensive experiment was executed on a comprehensive multi-cancer dataset including Breast Cancer, Oral Cancer, Lung and Colon Cancer, Kidney Cancer, and Acute Lymphocytic Leukemia (ALL), including both original and segmented images were analyzed to assess model robustness across heterogeneous clinical imaging conditions. Our approach is benchmarked alongside several state-of-the-art CNN and transfer models, including DenseNet121, DenseNet201, InceptionV3, ResNet50, EfficientNetB3, multiple ViT variants, and Swin Transformer models. However, all models were trained and validated using a unified pipeline, incorporating balanced data preprocessing, transfer learning, and fine-tuning strategies. The experimental results demonstrated that our proposed architecture consistently gained superior performance, reaching 100% test accuracy for lung-colon cancer, segmented leukemia datasets, and up to 99.23% accuracy for breast cancer classification. The model also achieved near-perfect precision, f1 score, and recall, indicating highly stable scores across divers cancer types. Overall, the proposed model establishes a highly accurate, interpretable, and also robust multi-cancer classification system, demonstrating strong benchmark for future research and provides a unified comparative assessment useful for designing reliable AI-assisted histopathological diagnosis and clinical decision-making.
cs.AI [Back]
[105] SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions cs.AI | cs.CL | cs.LGPDF
Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel
TL;DR: 本文提出了SUPERNOVA,一个用于强化学习与可验证奖励(RLVR)的数据策展框架,旨在提升大语言模型在通用推理任务(如因果推断和时间理解)上的能力。该框架的核心是利用包含专家标注真值的指令调优数据集,通过系统性的数据设计选择(如源任务选择、任务混合策略和合成干预)来生成高质量的RLVR训练数据。实验表明,基于SUPERNOVA训练的模型在多个具有挑战性的推理基准测试(如BBEH、Zebralogic和MMLU-Pro)上超越了Qwen3.5等强基线模型。
Details
Motivation: 尽管RLVR在数学和代码等正式领域的推理任务上取得了显著进展,但大语言模型在需要因果推断、时间理解等能力的通用推理任务上仍然存在困难。将RLVR扩展到通用推理的根本限制在于缺乏跨越多样化推理技能的高质量、可验证的训练数据。
Result: 在BBEH、Zebralogic和MMLU-Pro等具有挑战性的推理基准测试上,基于SUPERNOVA训练的模型超越了Qwen3.5等强基线。特别是在BBEH基准上,不同模型规模下的相对改进最高可达52.8%。
Insight: 论文宣称的创新点在于提出了SUPERNOVA数据策展框架,其核心洞察是指令调优数据集蕴含的丰富推理模式可以被系统地改造用于RLVR训练。从客观角度看,论文通过100多个受控RL实验,系统地分析了源任务选择、任务混合策略和合成干预这三个关键数据设计因素对下游推理性能的影响,并发现基于单个目标任务性能选择源任务的策略优于基于整体平均性能的策略,这为利用人类标注资源扩展RLVR到通用推理提供了实用的见解和方法论。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.
[106] From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI cs.AI | cs.CLPDF
Hongyin Zhu, Jinming Liang, Mengjun Hou, Ruifan Tang, Xianbin Zhu
TL;DR: 本文提出LOM-action系统,通过事件驱动的本体模拟,将业务事件转化为可审计的企业AI决策。系统采用双模式架构(技能模式与推理模式),确保决策完全基于模拟生成的场景有效图,并生成完整的审计日志。
Details
Motivation: 现有基于LLM的智能体系统存在架构缺陷:它们从无限制的知识空间中生成答案,未首先模拟业务场景如何重塑该空间,导致决策流畅但缺乏依据且无法审计。
Result: LOM-action在Doubao-1.8和DeepSeek-V3.2等前沿基线模型上实现了93.82%的准确率和98.74%的工具链F1分数,而基线模型尽管有80%的准确率,但F1分数仅为24-36%,揭示了’虚假准确率’现象。
Insight: 创新点在于事件驱动的本体模拟架构,将业务事件通过企业本体编码的条件触发确定性图突变,在隔离沙箱中生成场景有效的模拟图作为唯一决策来源。核心洞察是:本体治理、事件驱动的模拟(而非模型规模)是实现可信企业决策智能的架构前提。
Abstract: Existing LLM-based agent systems share a common architectural failure: they answer from the unrestricted knowledge space without first simulating how active business scenarios reshape that space for the event at hand – producing decisions that are fluent but ungrounded and carrying no audit trail. We present LOM-action, which equips enterprise AI with \emph{event-driven ontology simulation}: business events trigger scenario conditions encoded in the enterprise ontology~(EO), which drive deterministic graph mutations in an isolated sandbox, evolving a working copy of the subgraph into the scenario-valid simulation graph $G_{\text{sim}}$; all decisions are derived exclusively from this evolved graph. The core pipeline is \emph{event $\to$ simulation $\to$ decision}, realized through a dual-mode architecture – \emph{skill mode} and \emph{reasoning mode}. Every decision produces a fully traceable audit log. LOM-action achieves 93.82% accuracy and 98.74% tool-chain F1 against frontier baselines Doubao-1.8 and DeepSeek-V3.2, which reach only 24–36% F1 despite 80% accuracy – exposing the \emph{illusive accuracy} phenomenon. The four-fold F1 advantage confirms that ontology-governed, event-driven simulation, not model scale, is the architectural prerequisite for trustworthy enterprise decision intelligence.
[107] Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym cs.AI | cs.CLPDF
Lars Benedikt Kaesberg, Tianyu Yang, Niklas Bauer, Terry Ruas, Jan Philip Wahle
TL;DR: 本文介绍了Spatial-Gym,一个用于评估模型在空间推理任务中表现的Gymnasium环境,专注于2D网格谜题中的路径规划。通过比较8个模型在三种设置(一次性、逐步、逐步加回溯)下的表现,发现最佳模型GPT-OSS 120B的解决率仅为16.0%,远低于人类基线(98.0%)。研究揭示了模型在扩展推理努力、视觉输入负面影响以及扩展思维链推理优势方面的关键发现。
Details
Motivation: 现有基准测试通常要求模型一次性生成完整解决方案,这与人类在交互环境中逐步解决问题的过程不符,因此需要开发一个能隔离空间约束推理并支持逐步评估的环境,以更准确地衡量模型能力。
Result: 在500个episode的测试中,最佳模型GPT-OSS 120B的解决率为16.0%,比人类基线低82个百分点;逐步格式对较弱模型有帮助(提升达5.4%),但对较强模型有负面影响(降低达5.6%);回溯提高了episode完成度,但仅对较弱模型提升解决率。
Insight: 创新点包括:Spatial-Gym环境将空间推理任务设计为序列决策问题,支持逐步和回溯评估;研究发现模型无法根据难度调整推理努力,视觉输入会显著降低解决率,扩展思维链推理在逐步设置中仍保持3-5倍准确率优势,这为通过强化学习改进空间推理提供了框架。
Abstract: Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.
cs.IR [Back]
[108] Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search cs.IR | cs.CVPDF
Jiahao Zhang, Shaofei Huang, Yaxiong Wang, Zhedong Zheng
TL;DR: 本文提出了一种新的‘预训练-适应’范式,通过无标签测试数据进行离线测试时适应,以解决基于文本的行人检索中数据稀缺和标注成本高的问题。该方法引入了一种不确定性感知的测试时适应框架,利用双向检索不一致性来估计不确定性,从而在无需目标域监督的情况下缓解域偏移。
Details
Motivation: 基于文本的行人检索面临数据稀缺的固有局限,现有方法依赖‘预训练-微调’范式,需要大量标注的目标域数据,这在现实部署中不切实际。本文旨在消除对大规模目标域监督的依赖,实现仅使用无标签测试数据的动态模型适应。
Result: 在CUHK-PEDES、ICFG-PEDES、RSTPReid和PAB四个基准测试上验证了UATTA框架,在基于CLIP和XVLM的框架上均取得了一致性提升,超越了现有的离线测试时适应策略,为标签高效、可部署的行人检索系统设立了新基准。
Insight: 创新点在于提出了不确定性感知的测试时适应框架,通过双向检索不一致性机制估计不确定性,驱动无标签的模型重新校准,有效缓解域偏移。这为数据稀缺场景下的跨模态检索提供了一种实用的部署方案,减少了标注依赖。
Abstract: Text-based person search faces inherent limitations due to data scarcity, driven by stringent privacy constraints and the high cost of manual annotation. To mitigate this, existing methods usually rely on a Pretrain-then-Finetune paradigm, where models are first pretrained on synthetic person-caption data to establish cross-modal alignment, followed by fine-tuning on labeled real-world datasets. However, this paradigm lacks practicality in real-world deployment scenarios, where large-scale annotated target-domain data is typically inaccessible. In this work, we propose a new Pretrain-then-Adapt paradigm that eliminates reliance on extensive target-domain supervision through an offline test-time adaptation manner, enabling dynamic model adaptation using only unlabeled test data with minimal post-train time cost. To mitigate overconfidence with false positives of previous entropy-based test-time adaptation, we propose an Uncertainty-Aware Test-Time Adaptation (UATTA) framework, which introduces a bidirectional retrieval disagreement mechanism to estimate uncertainty, i.e., low uncertainty is assigned when an image-text pair ranks highly in both image-to-text and text-to-image retrieval, indicating high alignment; otherwise, high uncertainty is detected. This indicator drives offline test-time model recalibration without labels, effectively mitigating domain shift. We validate UATTA on four benchmarks, i.e., CUHK-PEDES, ICFG-PEDES, RSTPReid, and PAB, showing consistent improvements across both CLIP-based (one-stage) and XVLM-based (two-stage) frameworks. Ablation studies confirm that UATTA outperforms existing offline test-time adaptation strategies, establishing a new benchmark for label-efficient, deployable person search systems. Our code is available at https://github.com/nkuzjh/UATTA.
cs.GR [Back]
[109] MeshOn: Intersection-Free Mesh-to-Mesh Composition cs.GR | cs.CVPDF
Hyunwoo Kim, Itai Lang, Hadar Averbuch-Elor, Silvia Sellán, Rana Hanocka
TL;DR: MeshOn是一种用于两个输入网格物理和语义真实组合的方法,通过多步优化框架实现网格的无交叠贴合,并利用视觉到语言模型进行初始对齐,结合几何损失和物理启发的屏障损失进行优化,最终通过扩散先验辅助完成对象变形。
Details
Motivation: 解决在数字艺术工作流中,将配件网格与基础网格进行物理和语义真实组合时避免表面交叠的问题。
Result: 该方法在多种材料和目标区域上成功贴合配件,与生成方法和传统配准算法相比,展示了鲁棒性和准确性。
Insight: 创新点包括使用视觉到语言模型进行结构化初始对齐、结合几何损失与物理屏障损失的多步优化框架,以及扩散先验辅助的最终变形,旨在无缝集成到现有数字艺术家工作流程中。
Abstract: We propose MeshOn, a method that finds physically and semantically realistic compositions of two input meshes. Given an accessory, a base mesh with a user-defined target region, and optional text strings for both meshes, MeshOn uses a multi-step optimization framework to realistically fit the meshes onto each other while preventing intersections. We initialize the shapes’ rigid configuration via a structured alignment scheme using Vision-to-Language Models, which we then optimize using a combination of attractive geometric losses, and a physics-inspired barrier loss that prevents surface intersections. We then obtain a final deformation of the object, assisted by a diffusion prior. Our method successfully fits accessories of various materials over a breadth of target regions, and is designed to fit directly into existing digital artist workflows. We demonstrate the robustness and accuracy of our pipeline by comparing it with generative approaches and traditional registration algorithms.
cs.LG [Back]
[110] Robust Reasoning Benchmark cs.LG | cs.AI | cs.CLPDF
Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey
TL;DR: 本文提出一个包含14种扰动技术的评估管道,用于测试大语言模型在数学推理任务中的鲁棒性。研究发现前沿模型表现稳健,但开源推理模型在扰动下出现灾难性性能下降(平均准确率下降高达55%)。通过强制模型在单窗口内连续解题的实验,揭示了注意力机制会被中间推理步骤持续污染的问题。
Details
Motivation: 针对大语言模型在标准数学基准上表现良好但推理过程严重过拟合文本格式的问题,需要系统评估其真实推理鲁棒性。
Result: 在AIME 2024数据集构建的基准测试中,8个SOTA模型显示:前沿商业模型保持韧性,而开源模型(7B-120B参数)平均准确率下降最高达55%,Claude Opus 4.6在连续解题任务中也出现准确率衰减。
Insight: 创新点在于提出系统性扰动评估框架,并发现注意力机制污染现象;研究指出未来推理架构需在思维链中集成显式上下文重置机制,这为原子推理任务的粒度设计提出了根本性问题。
Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models’ working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model’s own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.
[111] Skip-Connected Policy Optimization for Implicit Advantage cs.LG | cs.CLPDF
Fengwei Teng, Jinyi Bai, Xinhao Yao, Demi Ruohan Wang, Jiahao Zhao
TL;DR: 本文提出了一种名为Skip-Connected Policy Optimization (SKPO)的新方法,用于解决在强化学习与视觉推理任务中,使用细粒度密集奖励时蒙特卡洛估计带来的高方差和符号不一致问题。该方法将推理过程分解为上游和下游两个阶段,通过跳跃连接机制,使模型既能利用有益的上游推理,又能绕过有缺陷的推理直接访问原始问题。
Details
Motivation: 动机在于发现实践中,在有限的采样预算下,对早期推理令牌进行蒙特卡洛优势估计会产生高方差和符号不一致的问题,这反而导致性能不如仅使用最终结果的GRPO方法,因此需要一种更稳定有效的优化策略。
Result: 在Qwen2.5-Math-7B和Llama-3.2-3B模型上的实验表明,SKPO在数学基准测试以及包括通用推理和代码生成的领域外任务上,相对于最强基线分别取得了3.91%和6.17%的相对性能提升。
Insight: 创新点在于提出了两阶段分解与跳跃连接架构,上游采用单流优化接收下游的密集奖励,下游保持组相对优化,这种设计实现了对推理过程的隐式优势建模,即使最终答案正确性相同,也能生成更高质量的中间步骤轨迹。
Abstract: Group Relative Policy Optimization (GRPO) has proven effective in RLVR by using outcome-based rewards. While fine-grained dense rewards can theoretically improve performance, we reveal that under practical sampling budgets, Monte Carlo estimation yields high-variance and sign-inconsistent advantages for early reasoning tokens, paradoxically underperforming outcome-only GRPO. We propose Skip-Connected Optimization (SKPO), which decomposes reasoning into upstream and downstream phases: upstream receives dense rewards from downstream Monte Carlo sampling with single-stream optimization; downstream maintains group-relative optimization, where a skip connection concatenates the upstream segment with the original problem, enabling the model to leverage helpful upstream reasoning while preserving the freedom to bypass flawed reasoning through direct problem access. Experiments demonstrate improvements of 3.91% and 6.17% relative gains over the strongest baselines on Qwen2.5-Math-7B and Llama-3.2-3B respectively across mathematical benchmarks and out-of-domain tasks including general reasoning and code generation. Further analysis reveals an implicit advantage: SKPO generates trajectories with higher intermediate-step quality even when matched for final correctness.
[112] Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition cs.LG | cs.AI | cs.CLPDF
Tiejin Chen, Huaiyuan Yao, Jia Chen, Evangelos E. Papalexakis, Hua Wei
TL;DR: 本文提出了一种名为MATU的新框架,用于量化基于大语言模型的多智能体系统的不确定性。该方法通过将多轮推理轨迹表示为嵌入矩阵并组织成高阶张量,利用张量分解技术来解耦和量化不同来源的不确定性,从而提供跨不同智能体结构的通用可靠性度量。
Details
Motivation: 现有不确定性量化方法通常为单轮输出设计,无法应对多智能体系统中由通信动态和角色依赖引入的独特复杂性,特别是多步推理中的级联不确定性、智能体间通信路径的可变性以及通信拓扑的多样性等挑战。
Result: 实验表明,MATU能够有效估计跨不同任务和通信拓扑的整体且鲁棒的不确定性,提供了全面的可靠性评估。
Insight: 创新点在于将整个推理轨迹建模为张量并进行分解,以解耦多智能体系统中不同来源的不确定性,这为理解和量化复杂交互系统的可靠性提供了新视角。
Abstract: While Large Language Model-based Multi-Agent Systems (MAS) consistently outperform single-agent systems on complex tasks, their intricate interactions introduce critical reliability challenges arising from communication dynamics and role dependencies. Existing Uncertainty Quantification methods, typically designed for single-turn outputs, fail to address the unique complexities of the MAS. Specifically, these methods struggle with three distinct challenges: the cascading uncertainty in multi-step reasoning, the variability of inter-agent communication paths, and the diversity of communication topologies. To bridge this gap, we introduce MATU, a novel framework that quantifies uncertainty through tensor decomposition. MATU moves beyond analyzing final text outputs by representing entire reasoning trajectories as embedding matrices and organizing multiple execution runs into a higher-order tensor. By applying tensor decomposition, we disentangle and quantify distinct sources of uncertainty, offering a comprehensive reliability measure that is generalizable across different agent structures. We provide comprehensive experiments to show that MATU effectively estimates holistic and robust uncertainty across diverse tasks and communication topologies.
[113] $p1$: Better Prompt Optimization with Fewer Prompts cs.LG | cs.CLPDF
Zhaolin Gao, Yu, Wang, Bo Liu, Thorsten Joachims
TL;DR: 这篇论文研究了提示优化的有效性在不同任务间的差异,并提出了一种名为$p1$的用户提示过滤方法,通过选择在候选系统提示间方差较大的少量用户提示子集来提升优化效果。
Details
Motivation: 动机是探究为何提示优化在某些任务上有效而在其他任务上失败,并基于对奖励方差分解的洞察,解决因生成随机性或用户提示异质性导致的优化困难。
Result: 在推理基准测试上的实验表明,$p1$方法相比在全数据集上训练显著提升了提示优化效果,并超越了GEPA等强基线,例如仅使用AIME 24中的两个提示训练得到的系统提示能很好地泛化到其他推理基准。
Insight: 创新点在于将奖励方差分解为响应间方差和系统提示间方差,揭示了优化成功的关键条件,并提出了基于方差过滤用户提示的简单有效方法,可减少所需提示数量并提升优化效率。
Abstract: Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose $p1$, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that $p1$ substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.
[114] Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG | cs.AI | cs.CL | cs.CVPDF
Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin
TL;DR: 本文提出了一种名为字典对齐概念控制(DACO)的框架,用于提升多模态大语言模型(MLLM)的安全性。该方法通过构建一个包含15,000个多模态概念的大型字典(DACO-400K数据集),并利用稀疏自编码器(SAE)对模型激活进行细粒度干预,从而在推理时有效抵御恶意查询,同时保持模型的通用能力。
Details
Motivation: 现有方法(如提示工程、响应分类或微调)在应对不断演变的恶意查询模式时效果有限、计算成本高或灵活性不足,且现有的激活引导方法通常只能处理有限的安全概念或难以在不影响其他概念的情况下调整特定概念。
Result: 在多个MLLM(如QwenVL、LLaVA、InternVL)和多个安全基准测试(如MM-SafetyBench、JailBreakV)上的实验表明,DACO能显著提升MLLM的安全性,同时保持其通用能力。
Insight: 核心创新在于构建了一个大规模、高质量的多模态概念字典,并将其与稀疏自编码器结合,实现了对模型内部激活的细粒度、概念级别的引导控制,为模型安全干预提供了一种灵活且有效的新范式。
Abstract: Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. We name the dataset DACO-400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.
[115] Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective cs.LG | cs.AI | cs.CLPDF
Tokio Kajitsuka, Ukyo Honda, Sho Takase
TL;DR: 本文重新审视了思维链(CoT)蒸馏中的能力差距问题,指出先前研究可能因实验设置不当而高估了其影响。作者发现CoT蒸馏常导致学生模型性能低于其蒸馏前基线,并提出了更现实的评估协议。结果表明,能力差距的影响在不同任务和设置下并不总占主导地位,尤其是在候选教师性能差异显著时。
Details
Motivation: 解决先前关于CoT蒸馏中能力差距(即师生能力不匹配导致蒸馏失败)的研究可能存在的实验设置偏差,从实践角度重新评估该问题,并提供更可靠的指导。
Result: 在提出的新评估协议下,发现能力差距效应并不总是主导蒸馏结果,其影响因任务和设置而异;当教师性能差异大时,该效应尤其不明显。
Insight: 创新点在于揭示了CoT蒸馏性能下降常被忽视(因仅比较蒸馏后模型),并提出了更现实的评估方法;客观来看,该研究强调了实验设计对结论的重要性,并为师生配对选择提供了实用见解。
Abstract: Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student’s pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.
cs.RO [Back]
[116] Multimodal Anomaly Detection for Human-Robot Interaction cs.RO | cs.CVPDF
Guilherme Ribeiro, Iordanis Antypas, Leonardo Bizzaro, João Bimbo, Nuno Cruz Garcia
TL;DR: 本文提出了一种名为MADRI的多模态异常检测框架,用于人机交互(HRI)场景。该框架首先将视频流转换为语义特征向量,然后基于这些特征进行重建以实现异常检测,并融合了机器人内部传感器数据和场景图信息,以同时捕捉外部视觉异常和内部故障。
Details
Motivation: 动机在于确保人机交互的安全性和可靠性,需要及时检测可能导致系统故障或不安全行为的意外事件。当前HRI中基于重建的模型已被广泛探索,但直接操作于特征向量的方法仍较少研究。
Result: 实验在一个自定义的拾放机器人任务数据集上进行,结果表明仅基于视觉特征向量的重建已能有效检测异常,而融合多模态信息进一步提升了检测性能。
Insight: 创新点在于将视频流转换为语义特征向量后进行重建检测,并整合多模态数据(视觉特征、内部传感器、场景图)以增强鲁棒性,这为HRI中的异常检测提供了新的多模态特征重建思路。
Abstract: Ensuring safety and reliability in human-robot interaction (HRI) requires the timely detection of unexpected events that could lead to system failures or unsafe behaviours. Anomaly detection thus plays a critical role in enabling robots to recognize and respond to deviations from normal operation during collaborative tasks. While reconstruction models have been actively explored in HRI, approaches that operate directly on feature vectors remain largely unexplored. In this work, we propose MADRI, a framework that first transforms video streams into semantically meaningful feature vectors before performing reconstruction-based anomaly detection. Additionally, we augment these visual feature vectors with the robot’s internal sensors’ readings and a Scene Graph, enabling the model to capture both external anomalies in the visual environment and internal failures within the robot itself. To evaluate our approach, we collected a custom dataset consisting of a simple pick-and-place robotic task under normal and anomalous conditions. Experimental results demonstrate that reconstruction on vision-based feature vectors alone is effective for detecting anomalies, while incorporating other modalities further improves detection performance, highlighting the benefits of multimodal feature reconstruction for robust anomaly detection in human-robot collaboration.
[117] VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis cs.RO | cs.CVPDF
Xiaolei Lang, Yang Wang, Yukun Zhou, Chaojun Ni, Kerui Li
TL;DR: 本文提出VAG,一种基于流匹配的双流框架,用于联合生成视频和动作序列,以解决现有世界模型在机器人策略学习中缺乏配对动作轨迹的问题。
Details
Motivation: 动机在于大规模人类遥操作数据收集成本高昂,现有世界模型要么不提供动作轨迹,要么视频与动作对齐性差,两阶段生成方法效率低且误差累积。
Result: 在模拟和真实世界设置中,VAG生成了对齐的视频-动作对,具有竞争力的预测质量,支持可执行轨迹回放,并能提供有用的合成预训练数据以提升下游策略泛化能力。
Insight: 创新点在于通过同步双流去噪和自适应3D池化机制传递全局视频上下文到动作分支,从而提升生成过程中的跨模态一致性,实现了高效的具身数据合成。
Abstract: Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.
cs.MM [Back]
[118] 2D or 3D: Who Governs Salience in VLA Models? – Tri-Stage Token Pruning Framework with Modality Salience Awareness cs.MM | cs.CV | cs.ROPDF
Zihao Zheng, Sicheng Tian, Zhihao Mao, Lingyue Zhang, Chenyue Li
TL;DR: 本文针对多视觉模态视觉-语言-动作(MVLA)模型因模态扩展导致输入token增多、推理速度下降的问题,提出了一种三阶段token剪枝框架。该框架通过分析2D和3D模态在模型不同处理阶段的显著性差异和动态变化,实现了最优的token选择和高效剪枝。实验表明,该方法能以极小的精度损失实现高达2.55倍的推理加速,且额外开销仅为5.8%。
Details
Motivation: MVLA模型通过引入3D模态提升了空间感知能力,但模态扩展导致输入token数量增加,带来了更大的加速需求。现有的token剪枝方案仅针对纯2D VLA模型设计,忽略了2D/3D模态间的显著性差异,因此需要一种能感知模态显著性的剪枝方法来优化MVLA模型。
Result: 实验结果显示,所提出的框架在保持最小精度损失的前提下,实现了高达2.55倍的推理加速,并且额外计算开销仅为5.8%。这表明该方法在效率和性能之间取得了良好的平衡。
Insight: 论文的核心创新在于提出了一个三阶段分析框架来捕捉2D/3D模态显著性在MVLA模型处理流程中的差异和动态变化,并据此设计了一个对应的三阶段token剪枝框架。这为多模态模型的token级优化提供了新的思路,即考虑不同模态在不同处理阶段的贡献度差异,进行更精细和自适应的剪枝,而非对所有token或模态采用统一策略。
Abstract: Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning is an effective optimization methods tailored to MVLA models. However, existing token pruning schemes are designed for 2D-only VLA models, ignoring 2D/3D modality salience differences. In this paper, we follow the application process of multi-modal data in MVLA models and develop a tri-stage analysis to capture the discrepancy and dynamics of 2D/3D modality salience. Based on these, we propose a corresponding tri-stage token pruning framework for MVLA models to achieve optimal 2D/3D token selection and efficient pruning. Experiments show that our framework achieves up to a 2.55x inference speedup with minimal accuracy loss, while only costing 5.8% overhead. Our Code is coming soon.
[119] Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation cs.MM | cs.CVPDF
Lingfeng Huang, Huizhong Guo, Tianjun Wei, Yingpeng Du, Zhu Sun
TL;DR: 本文提出了一种名为FixATE的方法,通过将视觉语言模型(VLM)的视觉注意力与用户特定的眼动注视模式对齐,以提升推荐系统评估中用户模拟器的保真度。该方法利用可解释性操作符探测VLM的内部视觉注意力,并通过学习个性化软提示来引导模型模仿用户的个性化注视模式,从而更真实地模拟用户在推荐界面中的感知和点击行为。
Details
Motivation: 现有基于大语言模型(LLM)的用户模拟器主要通过文本或结构化元数据来感知推荐内容,而忽略了真实用户在视觉界面浏览时的个性化眼动注视模式,这导致模拟器无法准确复现用户如何视觉感知推荐布局并做出决策,因此需要一种能对齐用户视觉注意力的模拟方法。
Result: 在基于轮播推荐场景的真实眼动追踪数据集上的实验表明,FixATE方法在三种不同的可解释性探测操作符和两种架构不同的VLM骨干网络上,均能一致地提升注意力对齐度和点击预测准确率。
Insight: 论文的创新点在于首次将用户个性化的眼动注视模式引入视觉语言模型的注意力调优中,通过可解释性操作符实现模型内部注意力与人类注视的可比性,并利用个性化软提示进行微调,使模型能够‘像用户一样看’,这为构建更忠实于用户真实视觉行为的模拟器提供了新路径。
Abstract: Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model’s (VLM’s) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM’s internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model’s attention toward each user’s characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model “see like the user” is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.
cs.NE [Back]
[120] Ge$^\text{2}$mS-T: Multi-Dimensional Grouping for Ultra-High Energy Efficiency in Spiking Transformer cs.NE | cs.AI | cs.CVPDF
Zecheng Hao, Shenghao Xie, Kang Chen, Wenxuan Liu, Zhaofei Yu
TL;DR: 本文提出了一种名为Ge²mS-T的新型脉冲视觉Transformer架构,通过在多维度(时间、空间和网络结构)上引入分组计算,旨在同时优化脉冲神经网络的训练和推理指标,解决现有方法在内存、精度和能耗方面难以兼顾的问题。
Details
Motivation: 现有脉冲视觉Transformer的训练方法(如ANN-SNN转换和时空反向传播)存在固有局限,无法同时优化内存开销、学习能力和能耗预算,导致其在训练和推理指标上表现不足。
Result: 实验表明,该方法在具有挑战性的基准测试中实现了卓越的性能和超高的能量效率,但摘要未具体说明基准名称或与SOTA的定量比较。
Insight: 主要创新点包括:1)提出分组指数编码的IF模型(ExpG-IF),实现无损转换和恒定训练开销;2)开发分组脉冲自注意力(GW-SSA),通过多尺度令牌分组和无乘法操作降低计算复杂度;3)首次系统性地建立了多维分组计算框架,以协同解决S-ViTs中的内存、学习和能耗三元问题。
Abstract: Spiking Neural Networks (SNNs) offer superior energy efficiency over Artificial Neural Networks (ANNs). However, they encounter significant deficiencies in training and inference metrics when applied to Spiking Vision Transformers (S-ViTs). Existing paradigms including ANN-SNN Conversion and Spatial-Temporal Backpropagation (STBP) suffer from inherent limitations, precluding concurrent optimization of memory, accuracy and energy consumption. To address these issues, we propose Ge$^\text{2}$mS-T, a novel architecture implementing grouped computation across temporal, spatial and network structure dimensions. Specifically, we introduce the Grouped-Exponential-Coding-based IF (ExpG-IF) model, enabling lossless conversion with constant training overhead and precise regulation for spike patterns. Additionally, we develop Group-wise Spiking Self-Attention (GW-SSA) to reduce computational complexity via multi-scale token grouping and multiplication-free operations within a hybrid attention-convolution framework. Experiments confirm that our method can achieve superior performance with ultra-high energy efficiency on challenging benchmarks. To our best knowledge, this is the first work to systematically establish multi-dimensional grouped computation for resolving the triad of memory overhead, learning capability and energy budget in S-ViTs.
cs.CR [Back]
[121] CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion cs.CR | cs.AI | cs.CV | cs.LGPDF
Akshit Jindal, Saket Anand, Chetan Arora, Vikram Goyal
TL;DR: 本文提出CLIP-Inspector(CI),一种针对提示调优CLIP模型的后门检测方法。该方法利用白盒模型访问权限和一组未标记的OOD图像,通过重构每个类别的潜在触发模式来判断模型是否被植入后门,并能通过微调修复模型。
Details
Motivation: 解决在MLaaS场景中,恶意服务提供商可能在提示调优后的CLIP模型中植入后门的安全风险,现有方法无法检测这种仅影响提示而不破坏编码器的后门,需要模型级别的验证方案。
Result: 在十个数据集和四种后门攻击上的实验表明,CI仅用1000张OOD图像在一个epoch内即可重构有效触发器,检测准确率达94%(50个模型中检测出47个),AUROC分数显著高于基线方法(0.973 vs 0.495/0.687)。
Insight: 创新点在于针对提示调优CLIP模型的后门检测,通过OOD触发反转实现模型级验证,并能利用重构的触发器进行微调以修复后门,为安全部署提供了可借鉴的检测与修复框架。
Abstract: Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, “Is the delivered model backdoored or not?” To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI’s reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.