Table of Contents

cs.CL [Back]

[1] H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models cs.CL | cs.AI | cs.LGPDF

Cutter Dawes, Aryan Sharma, Angelos Ioannis Lagos, Shivam Raval

TL;DR: 本文提出H-probes,一种线性探针方法,用于从语言模型的潜在表示中提取层次结构(深度和成对距离)。在合成树遍历任务中,该方法能稳健地识别包含层次结构的子空间,并通过消融实验证明这些子空间是低维的、对任务性能具有因果重要性,且具有良好的泛化能力。在真实世界数学推理等任务中也发现了类似但较弱的层次结构,表明模型不仅在句法和概念层面,还在更深层次的抽象(如推理过程本身)中表示层次。

Details

Motivation: 大型语言模型在需要层次推理的任务中表现出色,但对其如何在几何上表示此类思维所需的潜在结构缺乏分析。本文旨在开发一种方法来提取和分析这些潜在表示中的层次结构。

Result: 在合成树遍历任务中,H-probes能稳健地提取层次结构;消融实验表明,包含层次结构的子空间是低维的、对高任务性能具有因果重要性,且能在域内和域外泛化。在真实世界数学推理任务中也发现了类似但较弱的层次结构。

Insight: 创新点在于提出H-probes这一线性探针工具,系统地提取和分析语言模型潜在表示中的层次结构,并实证证明了模型在深层抽象(如推理过程)中也能表示层次,为理解模型内部表示提供了新视角。

Abstract: Representing and navigating hierarchy is a fundamental primitive of reasoning. Large language models have demonstrated proficiency in a wide variety of tasks requiring hierarchical reasoning, but there exists limited analysis on how the models geometrically represent the necessary latent constructions for such thinking. To this end, we develop \textit{H-probes}, a collection of linear probes that extract hierarchical structure, specifically depth and pairwise distance, from latent representations. In synthetic tree traversal tasks, the H-probes robustly find the subspaces containing hierarchical structure necessary to complete the tasks; furthermore, in comprehensive ablation experiments, we show that these hierarchy-containing subspaces are low-dimensional, causally important for high task performance, and generalize within- and out-of-domain. Furthermore, we find analogous, though weaker, hierarchical structure in real-world hierarchical contexts such as mathematical reasoning traces. These results demonstrate that models represent hierarchy not only at the level of syntax and concepts, but at deeper levels of abstraction – including the reasoning process itself.


[2] DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA cs.CL | cs.AI | cs.CVPDF

Anirudh Iyengar Kaniyar Narayana Iyengar, Tampu Ravi Kumar, Manan Suri, Raviteja Bommireddy, Dinesh Manocha

TL;DR: 本文提出了DIAGRAMS框架,一个用于图表问答(Diagram QA)中推理级归因的轻量级审查框架,旨在将每个问答对与推导答案所需的所有视觉区域关联起来,而不仅仅是包含最终答案的区域。该框架通过内部元模式和数据集适配器,将界面逻辑与特定数据集的JSON结构解耦,支持证据选择、候选区域生成以及人工验证和精炼。

Details

Motivation: 图表问答需要推理级归因,即链接问答对到所有用于推导答案的视觉区域,而不仅仅是最终答案区域。现有标注工具通常与特定数据集格式紧密耦合,创建此类结构化证据耗时且不灵活,因此需要一种轻量级、通用的审查框架来简化这一过程。

Result: 在六个Diagram QA数据集上,模型建议的证据与审查者最终选择相比,实现了85.39%的精确率和75.30%的召回率(微平均)。这些结果表明,该审查优先框架减少了手动区域创建的工作量,同时与最终推理级归因保持高度一致。

Insight: 创新点在于提出了一个轻量级、模式驱动的审查框架,通过解耦界面逻辑与数据集特定结构,支持灵活的推理级归因标注。从客观角度看,该框架可推广到多种图表类型,有助于数据集审计、基于证据的监督创建和评估,提升了图表问答任务的可解释性和效率。

Abstract: Diagram question answering (Diagram QA) requires reasoning-level attribution that links each question-answer pair to all visual regions needed to derive the answer, rather than only the region containing the final response. Creating such structured evidence across diagrams, charts, maps, circuits, and infographics is time-consuming, and existing annotation tools tightly couple their interfaces to dataset-specific formats. We present DIAGRAMS, a lightweight, schema-driven review framework that decouples interface logic from dataset-specific JSON structures through an internal meta-schema and dataset adapters. Given an image and QA pair with optional candidate regions, the system performs QA-conditioned evidence selection and proposes the regions required for reasoning. When QA pairs or candidate regions are missing, it generates them and supports human verification and refinement. Across six Diagram QA datasets, model-suggested evidence achieves 85.39% precision and 75.30% recall against reviewer-final selections (micro-averaged). These results indicate that the review-first framework reduces manual region creation while maintaining high agreement with final reasoning-level attributions. We release a public demo and installable package to support dataset auditing, grounded supervision creation, and grounded evaluation.


[3] CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine cs.CL | cs.AI | cs.LGPDF

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Goa

TL;DR: 本文提出了CLEAR框架,用于评估医学大语言模型在决策空间呈现、模糊性和不确定性影响下的可靠性,发现现有评估方法存在三个主要局限:增加合理答案选项数量会降低模型识别正确答案和拒绝错误答案的能力;将弃权选项从’以上都不是’改为’我不知道’会加剧模型的不谨慎行为;模型规模增大会加剧’谦逊赤字’问题。

Details

Motivation: 现有医学大语言模型评估依赖简化的考试式基准,未能反映真实医疗场景的模糊性,需要系统评估决策空间呈现、模糊性和不确定性对模型推理的影响。

Result: 在三个基准测试上评估17个LLM,发现增加合理答案选项使模型识别正确答案的准确率下降,包含’我不知道’选项反而增加错误答案选择,且模型规模越大’谦逊赤字’问题越严重。

Insight: 创新点在于提出CLEAR评估框架,系统扰动答案选项数量、弃权选项形式和语义框架,揭示了标准医学基准的局限性,并形式化定义了’谦逊赤字’指标,表明单纯扩大模型规模无法解决可靠性问题。

Abstract: Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs’ reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model’s ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from assertive rejection like “None of the Above” to uncertainty admission like “I don’t know” (IDK). Notably, just including IDK in the answer space increases incorrect answer selections. Lastly, we formalize the performance gap between identifying the correct answer and abstaining from incorrect ones as the humility deficit, which worsens with model scale. Our findings reveal limitations in standard medical benchmarks and underscore that scaling alone does not resolve LLM reliability issues.


[4] A Systematic Exploration of Text Decomposition and Budget Distribution in Differentially Private Text Obfuscation cs.CLPDF

Stephen Meisenbacher, Angelo Kleinert, Florian Matthes

TL;DR: 本文系统评估了差分隐私文本混淆中的文本分解和隐私预算分配技术,探讨了不同文本分块方法与ε预算分配策略的组合效果,揭示了设计选择对结果的重要影响,并证明了通过优化混淆程序最大化经验权衡的可行性。

Details

Motivation: 解决在差分隐私文本混淆中,如何将整体隐私预算合理分配到文本组件,以实现有效的文档级隐私保护,并探索不同文本分解与预算分配方法对混淆效果的影响。

Result: 实验表明,即使在可比隐私预算下,不同方法选择会导致显著不同的结果,为优化差分隐私混淆程序提供了实证依据。

Insight: 创新点在于系统性地结合文本分解与预算分配策略,揭示了设计选择的关键作用;可借鉴之处在于通过方法组合优化隐私-效用权衡,为实际应用提供指导。

Abstract: The goal of differentially private text obfuscation is to obfuscate, or “perturb”, input texts with Differential Privacy (DP) guarantees, such that the private output texts are quantifiably indistinguishable from the originals. While perturbation at the word level is intuitive, meaningful text privatization happens on complete documents. Recent research has laid the groundwork for reasoning about privacy budget distribution, namely, how an overall $\varepsilon$ budget can be sensibly distributed among the component pieces of a text. We perform a systematic evaluation of multiple text decomposition and budget distribution techniques in the context of DP text obfuscation, testing how different methods for chunking texts can be combined with techniques for allocating $\varepsilon$ to these chunks. Our experiments reveal that such design choices are very important, as even with comparable privacy budgets, significantly different results can occur based on which methods are chosen. In this, we provide credible evidence of the feasibility of maximizing empirical trade-offs by optimizing DP obfuscation procedures.


[5] Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines cs.CLPDF

Hugo Abonizio, Filipe Rocha Lopes, Roberto Lotufo, Rodrigo Nogueira

TL;DR: 该论文针对巴西统一医疗系统(SUS)的官方临床指南,通过知识注入方法,将Qwen2.5-14B-Instruct模型适配到巴西临床领域。研究从178份官方指南生成约7000万tokens的合成数据,并采用持续预训练和GRPO强化学习进行优化。同时,论文提出了HealthBench-BR和PCDT-QA两个基准测试,用于评估模型在巴西葡萄牙语临床知识上的表现。最终模型在仅140亿参数下,性能超越了GPT-5.2、Claude Sonnet 4.6等更大模型。

Details

Motivation: 当前大语言模型(LLMs)在巴西官方临床指南特定知识上表现不佳,且缺乏基于巴西葡萄牙语协议的临床知识召回基准测试。论文旨在填补这一空白,提升LLMs在巴西医疗领域的专业能力。

Result: 在提出的HealthBench-BR基准(1780个平衡的真/假临床断言)上达到83.9%的准确率,在PCDT-QA基准(890个开放式临床问题,由LLM法官评分)上达到85.4%的准确率。该结果超越了GPT-5.2、Claude Sonnet 4.6、Gemini 3.1 Pro以及Google AI Overview的基于网络的RAG系统,达到了SOTA水平。

Insight: 创新点包括:1)从官方指南生成多种格式(重述、维基风格文章、问答对)的合成数据以注入领域知识;2)采用持续预训练结合GRPO强化学习进行优化;3)创建了首个针对巴西葡萄牙语临床知识的基准测试HealthBench-BR和PCDT-QA。客观分析认为,生成器模型的多样性和强化学习的应用是性能提升的关键,为特定语言和领域的LLMs适配提供了可复现的方法论。

Abstract: Brazil’s Unified Health System (SUS) relies on official clinical guidelines that define diagnostic criteria, treatments, dosages, and monitoring procedures for over 200 million citizens. Yet current LLMs perform poorly on this guideline-specific knowledge, and no benchmark evaluates clinical recall grounded in Brazilian Portuguese protocols. We address this gap by adapting Qwen2.5-14B-Instruct to the Brazilian clinical domain. From 178 official guidelines (~5.4M tokens), we generate ~70M tokens of synthetic data in three formats – rephrases, wiki-style articles, and question-answer pairs – using four generator LLMs. We then apply continual pre-training followed by Group Relative Policy Optimization (GRPO). We introduce HealthBench-BR, with 1,780 balanced true/false clinical assertions, and PCDT-QA, with 890 open-ended clinical questions scored by an LLM judge. Our best model achieves 83.9% on HealthBench-BR and 85.4% on PCDT-QA, outperforming GPT-5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and Google AI Overview’s web-grounded RAG despite having only 14B parameters. Ablations show that generator diversity and reinforcement learning are critical to these gains. We release all datasets, benchmarks, and model weights to support reproducible clinical NLP research for Brazilian Portuguese. Code, data, and model weights are available at https://github.com/hugoabonizio/clinical-protocols-br


[6] Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation cs.CL | cs.IRPDF

Peiyang Liu, Qiang Yan, Ziqiang Cui, Di Liang, Xi Wang

TL;DR: 本文提出了CoRM-RAG框架,旨在解决标准检索增强生成(RAG)系统因过度依赖语义相关性而在用户查询存在认知偏差(如错误前提或确认偏误)时,反而检索出强化幻觉的谄媚证据的问题。该框架通过反事实风险最小化,将检索目标从相似性对齐转向决策安全性。

Details

Motivation: 标准RAG系统假设语义相关性等同于信息效用,但在现实的决策场景中,用户查询常带有认知偏差,导致最大化相关性反而会检索出强化错误(幻觉)的证据,即存在“相关性-鲁棒性鸿沟”。

Result: 在决策基准测试上的广泛实验表明,CoRM-RAG在对抗性设置下显著优于强大的密集检索器和基于LLM的重排序器,并能通过可靠的鲁棒性评分实现有效的风险感知弃权。

Insight: 核心创新在于利用因果干预思想,通过“认知扰动协议”在训练中模拟用户偏见,并蒸馏成一个轻量级的“证据评判器”模块,该模块学习识别具有足够证据强度的文档,以在对抗性查询扰动下引导模型走向正确。这为RAG系统从追求相关性转向保障决策安全提供了新思路。

Abstract: Standard Retrieval-Augmented Generation (RAG) systems predominantly rely on semantic relevance as a proxy for utility. However, this assumption collapses in realistic decision-making scenarios where user queries are laden with cognitive biases, such as false premises or confirmation bias. In such cases, maximizing relevance paradoxically promotes the retrieval of sycophantic evidence that reinforces hallucinations, a critical failure we term the ``Relevance-Robustness Gap’’. To bridge this gap, we propose CoRM-RAG (Counterfactual Risk Minimization for RAG), a framework that aligns retrieval with decision safety rather than mere similarity. Grounded in causal intervention, we introduce a Cognitive Perturbation Protocol to simulate user biases during training, which is then distilled into a lightweight Evidence Critic. This scoring module learns to identify documents that possess sufficient evidential strength to steer the model toward correctness despite adversarial query perturbations. Extensive experiments on decision-making benchmarks demonstrate that CoRM-RAG significantly outperforms strong dense retrievers and LLM-based rerankers in adversarial settings, while enabling effective risk-aware abstention through reliable robustness scoring. Our code is available at https://github.com/PeiYangLiu/CoRM-RAG.git.


[7] OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice cs.CLPDF

Rongyang Wang, Shuang Zhou, Jiashuo Wang, Wenya Xie, Xiaoxia Che

TL;DR: 本文提出了OralMLLM-Bench,一个用于评估多模态大语言模型在牙科实践中认知能力的综合基准。该基准涵盖根尖片、全景片和侧位头影测量片三种关键影像模态,定义了感知、理解、预测和决策四个认知类别,包含27个基于公共数据集构建的临床任务,并提供了手动标注和3820份临床医生评估。研究评估了包括GPT-5.2和GLM-4.6在内的六个前沿MLLM,揭示了MLLM与临床医生之间的性能差距,并分析了模型的优势、局限性和失败模式。

Details

Motivation: 多模态大语言模型在牙科影像分析中展现出潜力,但其是否具备进行影像分析所需的多层次认知能力尚不明确。因此,需要建立一个基准来系统评估MLLM在牙科影像分析中的认知能力,以促进开发符合临床认知、安全要求和工作流程复杂性的下一代人工智能系统。

Result: 在OralMLLM-Bench上评估了六个前沿MLLM(包括GPT-5.2和GLM-4.6)。结果表明,MLLM与临床医生在牙科实践中的表现存在差距。研究详细描绘了模型的优势和局限性,并分析了其失败模式。

Insight: 创新点在于构建了一个专门针对牙科影像分析、涵盖多种影像模态和多层次认知过程的综合评估基准。该基准将认知过程细化为感知、理解、预测和决策四个类别,并基于真实临床任务和医生评估,为评估和提升MLLM在专业领域的认知对齐能力提供了系统性的方法和数据资源。

Abstract: Multimodal large language models (MLLMs) have emerged as a promising paradigm for dental image analysis. However, their ability to capture the multi-level cognitive processes required for radiographic analysis remains unclear. Here, we present a comprehensive benchmark to evaluate the cognitive capabilities of MLLMs in dental radiographic analysis. It spans three critical imaging modalities, i.e., periapical, panoramic, and lateral cephalometric radiographs, and defines four cognitive categories: perception, comprehension, prediction, and decision-making. The benchmark comprises 27 clinically grounded tasks derived from public datasets, with manually curated annotations and 3,820 clinician assessments for evaluation. Six frontier MLLMs, including GPT-5.2 and GLM-4.6, are evaluated. We demonstrate the performance gap between MLLMs and clinicians in dental practice, delineate model strengths and limitations, characterize failure patterns, and provide recommendations for improvement. This data resource will facilitate the development of next-generation artificial intelligence systems aligned with clinical cognition, safety requirements, and workflow complexity in dental practice.


[8] A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis cs.CLPDF

Muhammad Arslan Manzoor, Dilshod Azizov, Daniil Orel, Umer Siddique, Zain Muhammad Mujahid

TL;DR: 本文提出了一个多视角媒体画像分析套件,通过构建大规模数据集MBFC-2025和多视角表示(包括Alexa图、超链接图、LLM生成图、文章和维基百科描述),系统评估了嵌入视角和融合策略,并在ACL-2020数据集上实现了SOTA结果,在MBFC-2025上建立了强基准。

Details

Motivation: 解决新闻媒体政治偏见和事实性自动检测领域缺乏统一资源、全面评估和系统分析的问题,特别是在标签稀疏和数据集多样性的挑战下,填补了关于有效方法、失败原因及经验性发现的空白。

Result: 在ACL-2020数据集上取得了最先进(SOTA)的结果,并在MBFC-2025上建立了强基准,通过广泛实验验证了方法的有效性。

Insight: 创新点包括引入大规模标签集MBFC-2025、构建多视角表示、系统评估嵌入和融合策略(如基于强化学习的融合变体),以及提供经验驱动的分析,为媒体画像研究提供了统一框架和可复现基准。

Abstract: News outlets shape public opinion at a scale that makes automated detection of political bias and factuality essential. However, the field still lacks unified resources, comprehensive evaluations across diverse approaches, and systematic analyses of the representations and fusion strategies that matter most, especially under label sparsity and dataset diversity. In addition, there is little empirical work reporting broad, observation-driven findings about what consistently works, what fails, and why. We address these gaps through four main contributions. First, we introduce MBFC-2025, a large-scale label set covering approximately 2,600 outlets from Media Bias/Fact Check (MBFC). Second, we construct multiview representations for ACL-2020 (Panayotov et al., 2022), which includes around 900 outlets, as well as for MBFC-2025. These representations span Alexa graphs, hyperlink graphs, LLM-derived graphs, articles, and Wikipedia descriptions. Third, we provide a systematic evaluation and analysis of embedding views and fusion strategies, including a reinforcement learning-based fusion variant. Fourth, we conduct extensive experiments that achieve state-of-the-art results on ACL-2020 and establish strong benchmarks on MBFC-2025.


[9] MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate cs.CL | cs.AI | cs.LGPDF

Jianze Wang, Ying Liu, Jinlong Chen, Xuchun Hu, Qilong Zhang

TL;DR: 本文提出MAD-OPD(多智能体辩论驱动的在线策略蒸馏),通过将传统的单一教师模型扩展为一个由多个教师组成的审议集体,来打破在线策略蒸馏(OPD)中存在的单一教师能力上限问题。该方法让教师集体对学生的在线策略状态进行辩论,产生一种涌现的集体智能来提供细粒度的监督,并根据辩论后各教师的置信度加权其贡献。同时,论文还引入了在线策略智能体蒸馏(OPAD)以将OPD扩展到智能体任务中,并提出了任务自适应的散度选择原则。

Details

Motivation: 现有在线策略蒸馏方法受限于单一教师的能力上限:当教师出错时,学生会继承错误。此外,OPD在智能体任务中探索不足,其中单步错误会在长轨迹中累积并导致训练不稳定。

Result: 在六种师生模型配置(Qwen3和Qwen3.5系列,学生模型1.7B-14B,教师模型8B-32B)和五个智能体与代码生成基准测试上,MAD-OPD在所有配置中均排名第一。在14B+8B→4B的配置下,相比更强的单一教师OPD方法,其在智能体任务上的平均性能提升了2.4%,在代码生成任务上提升了3.7%。

Insight: 核心创新在于用多智能体辩论机制取代单一教师,通过辩论产生集体智能监督信号,从而突破单一教师的能力瓶颈。此外,为智能体任务设计的OPAD框架通过步级采样稳定训练,以及根据任务类型(智能体稳定性 vs. 代码生成)自适应选择JSD或反向KL散度的原则,也具有借鉴意义。

Abstract: On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student’s on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher’s contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B$\to$4B setting it lifts the agentic average by $+2.4%$ and the code average by $+3.7%$ over the stronger single-teacher OPD.


[10] LLM Output Detectability and Task Performance Can be Jointly Optimized cs.CLPDF

Koshiro Saito, Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki

TL;DR: 本文提出PUPPET框架,通过强化学习微调大语言模型(LLM),旨在联合优化机器生成文本的可检测性和下游任务性能。实验表明,该方法在长问答、摘要和文章写作任务上,既能达到与水印方法相当的高可检测性,又能在下游任务上超越水印方法,且优化高效、鲁棒性强。

Details

Motivation: 现有水印方法虽能可靠检测机器生成文本,但常导致LLM在下游任务上性能下降,因此需要一种能同时提升可检测性和任务性能的方法。

Result: 在长问答、摘要和文章写作任务上,PUPPET训练的LLM在可检测性上与水印方法竞争,且下游任务性能更优;优化仅需数千样本和1-2 GPU小时,并在跨域任务、不同LLM家族和模型规模上保持一致性,且能抵抗改写攻击。

Insight: 创新点在于通过强化学习联合优化可检测性和任务性能,使用检测器和评估器作为奖励函数,实现了高效且鲁棒的优化,为部署LLM时的透明性提供了新思路。

Abstract: Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design – it embeds detectable signals into LLM outputs by biasing their token distributions. However, it has been reported that watermarked LLMs often perform worse on downstream tasks. We propose PUPPET, a framework that fine-tunes an LLM via reinforcement learning to generate text that is both more detectable and better performing on downstream tasks. We use two reward functions: a detector that outputs a machine-class likelihood and an evaluator that measures a task-specific metric. Experiments on long-form QA, summarization, and essay writing show that LLMs trained with PUPPET achieve high detectability competitive with watermarking methods while outperforming them on downstream tasks. The analysis shows that this optimization can be performed efficiently with only a few thousand samples in 1–2 GPU hours. Moreover, these gains are consistent across out-of-domain tasks, different LLM families, and model sizes, and are even robust to paraphrasing attacks.


[11] Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast cs.CL | cs.AIPDF

Jinyuan Feng, Xin Yu, Yiqun Chen, Xiaochi Wei, Yan Gao

TL;DR: 本文提出了一种名为FoCore(Focus on the Core)的无训练解码策略,用于提升扩散大语言模型(DLMs)的生成质量和效率。该方法通过识别并利用上下文中的高信息密度(HD)令牌,以自对比的方式引导生成过程,并进一步提出了加速变体FoCore_A,通过并行解码来减少推理时间。

Details

Motivation: 当前扩散大语言模型的解码策略未能充分利用其全局上下文建模的优势,通常表现出局部偏好,忽视了上下文中的信息密度异质性,从而降低了生成质量。

Result: 在数学、代码和逻辑推理基准测试(如HumanEval)上的实验表明,FoCore在LLaDA和Dream骨干网络上一致提升了生成质量和效率。例如,在HumanEval上,FoCore将pass@1从39.02提升至42.68,而FoCore_A将解码步骤减少了2.07倍,每样本延迟从20.76秒降至8.64秒(降低58.4%)。

Insight: 创新点在于系统地研究了高信息密度令牌的特性(如早期解码倾向),并提出了基于自对比的无训练解码策略,通过临时将HD令牌作为负样本来引导生成,以及通过检测HD令牌收敛进行并行解码以实现加速,这为优化扩散模型的解码过程提供了新思路。

Abstract: The iterative denoising paradigm of Diffusion Large Language Models (DLMs) endows them with a distinct advantage in global context modeling. However, current decoding strategies fail to leverage this capability, typically exhibiting a local preference that overlooks the heterogeneous information density within the context, ultimately degrading generation quality. To address this limitation, we systematically investigate high-information-density (HD) tokens and present two key findings: (1) explicitly conditioning on HD tokens substantially improves output quality; and (2) HD tokens exhibit an early-decoding tendency, converging earlier than surrounding tokens. Motivated by these findings, we propose Focus on the Core \textbf{(FoCore)}, a training-free decoding strategy that utilizes HD tokens in a self-contrast manner, wherein HD tokens are temporarily remasked as negative samples, to guide generation. We further introduce FoCore_Accelerate \textbf{(FoCore_A)}, an efficient variant that, upon detecting HD token convergence, performs parallel decoding over stable candidates within a local context window, substantially accelerating generation. Extensive experiments on math, code and logical reasoning benchmarks demonstrate that FoCore consistently improves generation quality and efficiency across both LLaDA and Dream backbones. For instance, on HumanEval, FoCore improves pass@1 from 39.02 to 42.68 over standard Classifier-Free Guidance, while FoCore-A reduces the number of decoding steps by 2.07x and per-sample latency from 20.76s to 8.64s (-58.4%).


[12] Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning cs.CL | cs.AI | cs.IRPDF

Sangkwon Park, Donghun Kang, Jisoo Mok, Sungroh Yoon

TL;DR: 论文提出Verbal-R3框架,通过引入Verbal Annotations(语言化注释)作为检索结果与LLM推理能力之间的桥梁,以解决传统RAG中原始检索文本与LLM上下文整合不佳的问题。该框架包含生成器和语言化重排序器,结合相关性引导的测试时扩展,在复杂问答基准上实现了SOTA性能。

Details

Motivation: 传统RAG将原始检索文本直接注入LLM上下文,导致检索信息整合效果不理想,因此需要一种能明确表达查询与检索上下文逻辑连接的方法来提升LLM的推理与回答能力。

Result: 在复杂问答基准上实现了最先进的性能,验证了框架的有效性。

Insight: 创新点在于提出语言化注释作为检索与推理的桥梁,并构建了包含生成器和语言化重排序器的智能体RAG框架,通过测试时计算资源的高效分配进一步优化推理过程。

Abstract: The conventional Retrieval-Augmented Generation (RAG) paradigm of injecting raw retrieved texts into the Large Language Model (LLM)’s context often results in suboptimal integration of retrieved information. This paper proposes to bridge retrieval results and the LLM’s reasoning ability through Verbal Annotations, analytic narratives that explicitly articulate the logical connection between a search query and retrieved contexts. Our empirical investigation reveals the potential of Verbal Annotations to substantially enhance the LLM’s ability to generate accurate, contextually-grounded responses. Motivated by this finding, we introduce Verbal-R3, a novel agentic RAG framework that consists of a Generator and a Verbal Reranker. The Generator performs iterative retrieval and reasoning, while the Verbal Reranker returns relevance scores and Verbal Annotations to guide the reasoning and answering process of the Generator. The inference process of Verbal-R3 is further refined through relevance-guided test-time scaling, which efficiently allocates test-time compute for effective trajectory expansion. Verbal-R3 achieves state-of-the-art performance on complex Question Answering benchmarks, validating the effectiveness of the proposed framework.


[13] Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression cs.CL | cs.CV | cs.LGPDF

Yao Du, Shanshan Li, Xiaomeng Li

TL;DR: 本文提出了一种基于分布感知的强化学习框架,通过Group Relative Policy Optimization和Concordance Correlation Coefficient奖励机制,解决多模态大语言模型在长尾分布下数值回归任务中存在的回归均值偏差和尾部性能差的问题。

Details

Motivation: 现有MLLM训练范式缺乏跨样本关系监督,导致在长尾目标分布下进行数值回归时,模型偏向高密度区域,出现回归均值行为和尾部性能不佳。

Result: 在统一的长尾回归基准测试套件上,该方法相比SFT和现有MLLM回归方法取得了一致的性能提升,尤其在中等样本和少样本场景下增益显著。

Insight: 创新点在于引入了批级别基于比较的监督(通过一致性相关系数奖励),从相关性、尺度和均值三个维度对齐预测与真实分布,且无需修改模型架构即可即插即用。

Abstract: Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.


[14] Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks cs.CL | cs.AIPDF

Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi

TL;DR: 本文介绍了Medmarks,一个完全开源的医学大语言模型(LLM)评估套件,包含30个基准测试,涵盖问答、信息抽取、医学计算和开放式临床推理等任务。研究对61个模型的71种配置进行了系统评估,发现前沿推理模型(如Gemini 3 Pro Preview、GPT-5.1/5.2)表现最佳,专有模型在token效率上显著优于开源模型,医学微调模型优于通用模型,且模型(尤其是小模型和Grok 4)存在答案顺序偏差。部分评估(Medmarks-T)可直接用作强化学习环境来微调LLM的医学推理能力。

Details

Motivation: 解决医学领域LLM评估的挑战,包括基准测试饱和、数据可访问性有限以及相关任务覆盖不足的问题,现有评估套件要么已饱和,要么严重依赖受限数据集,或缺乏全面的模型覆盖。

Result: 在30个基准测试上使用可验证指标和LLM-as-a-Judge进行系统评估,结果显示:前沿推理模型(Gemini 3 Pro Preview、GPT-5.1、GPT-5.2)在基准测试中表现最佳;前沿专有模型在token效率上显著优于开源替代模型;医学微调模型优于其通用对应模型;模型(尤其是小模型和Grok 4)易受答案顺序偏差影响。

Insight: 创新点在于提供了一个全面、开源、可复现的医学LLM评估套件,覆盖广泛任务并支持直接用作强化学习环境;客观分析认为,其系统性评估框架和关注token效率、偏差等实际应用问题具有借鉴价值。

Abstract: Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks


[15] ReMedi: Reasoner for Medical Clinical Prediction cs.CLPDF

Yushi Cao, Yiming Chen, Hongchao Jiang, Hung-yi Lee, Robby T. Tan

TL;DR: 本文提出了ReMedi框架,旨在通过生成理由-答案对来改进基于电子健康记录(EHR)的临床结果预测。该框架利用具有挑战性的样本再生机制,结合真实答案作为提示来增强推理,并通过微调和偏好调优提升模型性能。

Details

Motivation: 解决从复杂且异质的电子健康记录中预测临床结果的挑战,现有LLM方法主要依赖模型内部能力解释上下文,缺乏对推理过程的显式增强。

Result: 在多个EHR预测任务上的实验表明,ReMedi在F1分数上比当前最先进基线方法提升高达19.9%,实现了显著的性能增益。

Insight: 创新点在于将真实结果指导集成到偏好数据构建循环中,通过再生理由-答案变体进行调优,从而显式提升模型的临床推理能力。

Abstract: Predicting future clinical outcomes from electronic health records (EHR) remains challenging due to the complexity and heterogeneity of patient data. LLMs have shown strong potential for such predictive tasks, yet existing approaches mainly focus on enhancing medical knowledge through distillation or RAG while relying on the model’s internal ability to interpret contextual information. In this work, we present ReMedi (Reasoner for Medical Clinical Prediction), a framework for improving clinical outcome prediction from EHR. ReMedi generates rationale-answer pairs using a challenging sample regeneration mechanism for complex clinical questions, which leverages ground-truth answers as hints to enhance reasoning for further fine-tuning and preference tuning. ReMedi integrates ground-truth outcome guidance into the preference data construction loop, regenerating rationale-answer variants. By tuning on these rationale-answer pairs, the model improves its predictive performance. Experiments on multiple EHR prediction tasks demonstrate substantial gains of up to 19.9 percent over state-of-the-art baselines in terms of F1 score, underscoring ReMedi’s effectiveness in real-world clinical prediction.


[16] FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning cs.CL | cs.AIPDF

Zebin Guo, Weidong Geng, Ruichen Mao

TL;DR: 本文提出了FT-RAG,一个用于复杂表格推理的细粒度检索增强生成框架。该框架通过将表格分解为条目级语义单元来构建结构化知识图谱,并利用结构邻居扩展机制进行图谱检索,再通过多模态融合整合检索到的表格上下文。为了解决该领域专业数据集的稀缺问题,作者还构建了一个名为Multi-Table-RAG-Lib的基准测试集。实验表明,FT-RAG在多项指标上超越了现有基线模型,在表格级和单元格级命中率上分别提升了23.5%和59.2%,并在生成准确率上取得了显著提升,为混合模态文档的复杂推理任务树立了新的SOTA性能。

Details

Motivation: 传统的检索增强生成(RAG)系统在处理结构化表格数据时表现不佳,主要原因是检索粒度过于粗糙以及对表格语义理解不足。本文旨在解决这些问题,提升LLMs在复杂表格推理任务上的能力。

Result: FT-RAG在作者新构建的Multi-Table-RAG-Lib基准测试集(包含9870个高复杂度QA对)上进行了评估。它在所有指标上都超越了表现最佳的基线模型,具体表现为:表格级命中率提升23.5%,单元格级命中率提升59.2%,生成任务中的精确值准确率召回提升了62.2%。这些结果验证了其在纯表格和异构表格-文本上下文中的事实依据有效性,并确立了新的SOTA性能。

Insight: 论文宣称的创新点包括:1) 提出了一种细粒度的、基于知识图谱的RAG框架,通过将表格分解为条目级单元来提升检索精度;2) 设计了结构邻居扩展机制以查找语义关联实体;3) 引入了多模态融合来整合检索到的表格上下文;4) 构建了一个高复杂度的多表格推理基准数据集。从客观角度看,其核心创新在于将表格数据转化为结构化图谱进行细粒度检索,并系统性地解决了表格与文本混合模态下的信息融合与推理难题,为RAG在结构化数据上的应用提供了新思路。

Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding responses in external knowledge during inference. However, conventiona RAG systems under-perform on structured tabular data, largely due to coarse retrieval granularity and insufficient table semantic comprehension. To address these limitations, we introduce FT-RAG, a fine-grained framework that employs knowledge association by decomposing tables into entry-level semantic units to construct a structured graph. FT-RAG employs a structural neighbor expansion mechanism to find semantically connected entities during graph retrieval, followed by multi-modal fusion to consolidate the context of table retrieval results. Further, to address the scarcity of specialized datasets in this domain, we introduce Multi-Table-RAG-Lib, a benchmark comprising 9870 QA pairs with high complexity and difficulty, curated to demand multi-table integration and text-table information fusion for reasoning. FT-RAG surpasses top-performing baselines across all metrics, achieving a 23.5% and 59.2% improvement in table-level and cell-level Hit Rates, respectively. Generation performance also sees a remarkable 62.2% increase in exact value accuracy recall. These metrics verify the framework’s effectiveness in factual grounding across both pure tabular and heterogeneous table-text contexts. Therefore, our method establishes a new state-of-the-art performance for complex reasoning over mixed-modality documents.


[17] GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory cs.CL | cs.AIPDF

Yushi Sun, Bowen Cao, Dong Fang, Lingfeng Su, Wai Lam

TL;DR: 本文提出GRAVITY,一种即插即用的结构化记忆模块,用于增强长程对话代理的推理能力。它从原始对话中提取实体关系图、时序事件链和跨会话主题摘要三种互补的知识表示,并在生成时将这些结构化上下文注入到宿主系统的提示中,从而将分散的证据合成为连贯的、与查询相关的上下文,无需修改宿主模型架构。

Details

Motivation: 现有长程对话代理的记忆检索机制虽然日益复杂,但检索到的片段通常以非结构化文本形式输入语言模型,缺乏关系、时序和主题结构,这限制了复杂推理能力。本文旨在弥合这一推理鸿沟。

Result: 在LongMemEval和LoCoMo基准测试上对五种不同记忆系统进行了广泛评估。平均而言,GRAVITY将LLM-judge准确率提高了7.5%到10.1%。提升效果与基线强度呈负相关:最弱的宿主系统提升了12.2%,最强的仍提升了3.8%到5.7%。

Insight: 创新点在于提出了一种架构无关的结构化上下文锚定范式,通过提取并注入三种互补的结构化知识表示(实体关系、时序因果、主题摘要),有效整合分散证据以支持连贯推理。该方法作为即插即用模块,无需修改宿主模型,具有广泛的适用性。

Abstract: Long-horizon conversational agents rely on memory systems with increasingly sophisticated retrieval mechanisms. However, retrieved fragments are typically fed to the language model as unstructured text, lacking the relational, temporal, and thematic structures essential for complex reasoning. To bridge this reasoning gap, we introduce GRAVITY (\textbf{G}eneration-time \textbf{R}elational \textbf{A}nchoring \textbf{V}ia \textbf{I}njected \textbf{T}opological Memor\textbf{Y}), a plug-and-play structured memory module. GRAVITY extracts three complementary knowledge representations from raw conversational utterances: entity profiles grounded in relational graphs, temporal event tuples linked into causal traces, and cross-session topic summaries. At generation time, it injects these representations into the host system’s prompt as structured anchoring contexts. This approach effectively synthesizes scattered evidence into a coherent, query-relevant context without requiring any architectural modifications to the host model. Extensive evaluations across five diverse memory systems on the LongMemEval and LoCoMo benchmarks demonstrate the efficacy of our approach. On average, GRAVITY improves LLM-judge accuracy by 7.5–10.1%. Gains are inversely correlated with baseline strength: the weakest host improves by 12.2% while the strongest still gains 3.8–5.7%. These findings establish structured context anchoring as a broadly effective, architecture-agnostic augmentation paradigm for long-horizon conversational memory.


[18] The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning cs.CLPDF

Kwan Soo Shin

TL;DR: 本文提出了’推理陷阱’概念,指出当相同语言模型进行多轮辩论时,会保留答案准确性但损害推理过程的忠实性。作者建立了包含SFS指标、EGSR方法和DPI定理的理论框架,通过实验证明传统辩论方法会显著降低证据忠实度,而提出的EGSR方法能恢复98%的忠实性。

Details

Motivation: 解决多智能体辩论中存在的’辩论陷阱’问题——相同模型的多轮交互会保持答案准确性但损害推理过程的证据忠实性,揭示封闭系统推理的固有局限性。

Result: 在SciFact和FEVER数据集上的16种实验条件下,传统辩论方法使SFS指标下降43%-98.3%,而EGSR方法能恢复98%的忠实性;人类评估者的一致性也很低(Fleiss kappa ≤ +0.018)。

Insight: 提出了基于数据处理不等式的理论边界(DPI Bound),证明封闭系统推理存在信息衰减的必然性;创新性地将证据忠实度与答案准确性分离评估;用证据驱动的苏格拉底式推理替代对抗性辩论。

Abstract: When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other’s outputs, tends to preserve answer accuracy while degrading the reasoning behind those answers. We name the multi-agent case the Debate Trap and the broader phenomenon the Reasoning Trap, offering a programmatic theory of evidence-grounded reasoning failure.The framework has three parts: (i) SFS (Supported Faithfulness Score), a claim-level metric verifying decomposed atomic claims against provided evidence (decomposer-invariant rankings: Spearman rho=1.0); (ii) EGSR (Evidence-Grounded Socratic Reasoning), replacing adversarial argumentation with evidence-grounded inquiry; (iii) Theorem 1 (DPI Bound): under standard MAD, the chain E -> O^0 -> O^1 -> … is Markov, and the Data Processing Inequality implies E[I(E;O^{t+1})] <= E[I(E;O^t)]. Three companion results – open-system recovery (Theorem 2), EGSR accumulation (Lemma 2), and vote-aggregation floor (Proposition 1) – partition multi-step LLM reasoning by its information-theoretic relationship to E. Across 16 conditions on SciFact (300 claims) and FEVER (1,000 claims), DebateCV (C13) preserves 88% of baseline accuracy while SFS drops 43%; majority-vote MAD (C15) reduces SFS to 1.7% of baseline (p < 10^{-6}, d = -0.96); EGSR recovers 98%. An R6 cohort study (Korean n=10x30 FEVER; English n=3x200 SciFact) finds inter-rater Fleiss kappa <= +0.018 with 0.8-1.4 Likert intra-rater shifts across language and domain – the human agreement that faithfulness metrics have been calibrated against is not itself stable. We offer one falsifiable conjecture: any closed-system reasoning protocol preserving Theorem 1’s Markov structure is, in expectation, subject to the same DPI bound.


[19] Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality cs.CLPDF

Wen Luo, Guangyue Peng, Liang Wang, Nan Yang, Wei Li

TL;DR: 本文提出了一种名为‘探索-承诺解耦’的新范式,旨在解决大型推理模型在长文本生成中的幻觉问题。通过将知识探索与最终答案输出分离,并实例化为‘校准感知生成’框架,该框架在推理过程中加入可靠性估计,并优先选择可靠内容输出,从而显著提升了事实准确性并降低了解码时间。

Details

Motivation: 现有提升事实性的方法(如弃权和事实性驱动优化)遵循‘耦合的探索-承诺’范式,无条件地将中间推理传播到最终输出,限制了信息选择和整合的细粒度控制。本文旨在通过解耦探索与承诺,使模型能够谨慎探索并谨慎回答,以更精细地控制信息流,从而减少长文本生成中的错误累积。

Result: 在五个长文本事实性基准测试和多个模型系列上,CAG将事实性提升了高达13%,同时将解码时间减少了高达37%。

Insight: 核心创新点是提出了‘探索-承诺解耦’这一原则性范式,并将其具体实现为CAG框架。该框架通过为中间推理步骤提供校准后的可靠性估计,并基于此在最终输出中优先选择可靠内容,实现了端到端的校准感知生成能力。这为构建更可靠、更具自我意识的长文本生成系统提供了新方向。

Abstract: Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approaches to improving factuality, including abstention and factuality-driven optimization, follow a \emph{coupled exploration-commitment} paradigm, in which intermediate reasoning is unconditionally propagated to the final output, limiting fine-grained control over information selection and integration. In this paper, we propose an \textbf{Exploration-Commitment Decoupling} paradigm that disentangles knowledge exploration from final commitment, enabling models to explore with awareness while answering cautiously. We instantiate the paradigm with \textbf{Calibration-Aware Generation (CAG)}, a framework that equips models with end-to-end, calibration-aware generation capabilities, by augmenting intermediate reasoning with calibrated reliability estimates and prioritizing reliable content in final outputs. Across five long-form factuality benchmarks and multiple model families, CAG improves factuality by up to 13%, while reducing decoding time by up to 37%. Overall, our work highlights decoupling as a principled approach for more reliable long-form generation, offering directions for trustworthy and self-aware generative systems.


[20] RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences cs.CL | cs.AIPDF

Yangyang Zhou, Yi-Chen Li

TL;DR: 本文提出了RMGAP基准,用于评估奖励模型在多样化用户偏好下的泛化能力。该基准包含1,097个实例,覆盖聊天、写作、推理和安全四个领域,通过为每个提示生成四种不同语言风格的回答,并构建特定场景和释义变体来模拟多样偏好。对24个SOTA奖励模型的评估显示其泛化能力存在显著不足。

Details

Motivation: 现有奖励模型基准通常基于单一通用偏好设计,无法评估模型在多样化用户偏好下的泛化能力,因此需要一个新的基准来填补这一关键空白。

Result: 在RMGAP基准上评估了24个SOTA奖励模型,最佳模型仅达到49.27%的Best-of-N准确率,表明奖励模型泛化能力仍有很大提升空间。

Insight: 创新点在于构建了一个模拟真实世界多样化偏好的基准,通过生成不同风格的回答、设计特定场景提示和添加释义变体来系统评估奖励模型的泛化性能;客观来看,该方法为奖励模型的鲁棒性评估提供了更贴近实际应用的框架。

Abstract: Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability of reward models. By “generalizability”, we mean the ability of RMs to correctly rank responses to align with diverse user preferences. However, existing reward model benchmarks are typically designed around a universal preference, failing to assess this generalization. To address this critical gap, we introduce RMGAP, a benchmark comprising 1,097 instances across Chat, Writing, Reasoning, and Safety domains. Since different users exhibit diverse preferences for the same task, we first generate four distinct responses with different linguistic profiles for each collected prompt. However, the original prompt set lacks the specificity to convey different preferences. We therefore construct tailored prompts by contrasting these candidates and designing scenarios in which one response becomes the uniquely appropriate choice. Moreover, we observe that users often express the same preference using different phrasings, and thus extend each prompt with two paraphrased variants. Our evaluation of 24 state-of-the-art RMs reveals their substantial limitations: even the best RM achieves only 49.27% Best-of-N accuracy, highlighting considerable room for improvement in reward model generalization. Related data and code are available at https://github.com/nanzhi84/RMGAP.


[21] Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation cs.CLPDF

Xuemei Tang, Xufeng Duan, Zhenguang G. Cai

TL;DR: 该论文研究了大型语言模型(LLMs)在生成多项选择题(MCQs)时存在的系统性位置偏差问题,即正确答案并非均匀分布在各个选项中。通过对10个LLMs和5个视觉语言模型(VLMs)在三个MCQ生成任务上进行广泛实验,发现这些偏差具有结构性,且在同一模型家族内呈现相似模式。通过探测实验,发现问题的隐藏表征编码了正确答案位置的预测信号,表明答案位置可能在生成过程中被隐式规划。基于此,研究应用激活引导技术来操纵内部表征并影响答案位置,结果表明引导可以部分控制位置偏好并显著改变答案位置分布。

Details

Motivation: 解决LLMs在生成多项选择题时存在的系统性位置偏差问题,以确保正确答案的均匀分布,这对于可靠的多选题构建和评估至关重要。

Result: 在三个MCQ生成任务上的实验表明,LLMs和VLMs存在结构性的位置偏差;激活引导技术可以部分控制位置偏好并显著改变答案位置分布。

Insight: 创新点在于揭示了LLMs在生成过程中可能隐式规划答案位置,并通过激活引导技术实现了对位置偏好的部分控制。这为研究LLMs的隐式规划行为提供了一个实用框架,并强调了可控生成在可靠MCQ构建中的重要性。

Abstract: Large language models (LLMs) are increasingly used to generate multiple-choice questions (MCQs), where correct answers should ideally be uniformly distributed across options. However, we observe that LLMs exhibit systematic position biases during generation. Through extensive experiments with 10 LLMs and 5 vision-language models (VLMs) on three MCQ generation tasks, we show that these biases are structured, with similar patterns emerging within model families. To investigate the underlying mechanisms, we conduct probing experiments and find that hidden representations in the question stem encode predictive signals of the correct answer position, suggesting that answer position may be implicitly planned during generation. Building on this insight, we apply activation steering to manipulate internal representations and influence answer position. Our results show that steering can partially control positional preferences and substantially shift answer position distributions. Our findings provide a practical framework for studying implicit positional planning in LLMs and highlight the importance of controllable generation for reliable MCQ construction and evaluation.


[22] Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models cs.CL | cs.AIPDF

Kotaro Furuya, Takahito Tanimura

TL;DR: 本文提出了一种名为StALT(时空潜在转移幅度)的无训练轨迹统计量,用于分析大型推理模型(LRMs)在解码过程中隐藏状态的时空动态。研究发现,成功的推理轨迹展现出独特的时空模式:广泛的时序动态与局部的层间集中性,而这一结构在非推理模型或知识密集型任务中较弱。StALT能够可靠地区分推理密集型任务中的正确与错误轨迹,为理解模型内部计算提供了超越输出评估的实证工具。

Details

Motivation: 尽管大型推理模型能生成冗长的解决方案,但尚不清楚这些输出痕迹是否反映了实质性的内部计算,还是仅仅是冗长和过度思考。现有的隐藏状态分析通常进行粗粒度聚合,可能掩盖了推理计算背后的词元和层结构。

Result: 在多种模型和基准测试中,StALT在推理密集型任务中能可靠地区分正确与错误的轨迹,其作为无标签正确性信号的性能与基于输出空间和长度的强基线方法相当。干预分析进一步表明,该时空幅度会系统性地响应增加或减少内部推理需求的操纵。

Insight: 创新点在于将隐藏状态动态分析从静态的层聚合扩展到解码步骤(时间)和层(空间)的联合时空维度,并形式化为一个可计算的统计量StALT。这为理解模型内部推理动态提供了新的、可测量的实证证据和实用探针,超越了仅基于输出的评估。

Abstract: Large reasoning models (LRMs) generate extended solutions, yet it remains unclear whether these traces reflect substantive internal computation or merely verbosity and overthinking. Although recent hidden-state analyses suggest that internal representations carry correctness-related signals, their coarse aggregations may obscure the token and layer structure underlying reasoning computation. We investigate hidden-state transitions across decoding steps and layers, and identify a distinct spatiotemporal pattern in LRMs: successful trajectories exhibit broad temporal dynamics with localized layer-wise concentration, while this structure is weaker in non-reasoning models and knowledge-heavy domains. We formalize this characteristic as Spatiotemporal Amplitude of Latent Transition (StALT), a training-free trajectory statistic that summarizes temporal changes between adjacent tokens weighted by within-token layer saliency. Across diverse models and benchmarks, StALT reliably separates correct from incorrect trajectories in reasoning-intensive regimes, providing a competitive label-free correctness signal alongside strong output-space and length-based baselines. Intervention analyses further show that this spatiotemporal amplitude responds systematically to manipulations that increase or reduce the demand for internal reasoning, supporting its association with latent reasoning dynamics in LRMs. These findings provide empirical evidence that LRMs exhibit measurable hidden-state dynamics and offer a practical probe for understanding internal computation beyond output-based evaluation.


[23] Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models cs.CLPDF

Nikolaos Giarelis, Charalampos Mastrokostas, Nikos Karacapilidis

TL;DR: 本文提出了Maistros 8B,一个通过知识蒸馏从大型推理模型(LRM)中学习并经过微调的、最先进的开源希腊语大语言模型。为了支持其开发,作者构建了CulturaQA数据集和一个高效的多语言评估框架,并在九个希腊语QA数据集上对九个LLM进行了全面评估。

Details

Motivation: 解决现有大语言模型在处理复杂查询时因内部知识有限或多步推理需求而产生的错误,以及大型推理模型参数量大、推理速度慢、难以部署的问题。同时,针对希腊语等资源匮乏语言缺乏高质量训练数据,导致现有多语言LLM性能不佳的研究空白。

Result: 开发了Maistros 8B模型,并在九个由人工整理的希腊语问答数据集上对九个LLM进行了评估。结果表明,Maistros 8B在希腊语任务上达到了最先进的性能水平。

Insight: 核心创新点在于通过知识蒸馏将大型推理模型(LRM)的推理能力迁移到更小、更高效的希腊语专用LLM中。这为解决资源匮乏语言的LLM性能问题提供了一种有效路径,即利用高质量、LRM生成并经人工校验的数据集进行蒸馏和微调,而非依赖海量原始文本。同时,提出的可适配多语言/任务的评估框架也具有通用价值。

Abstract: Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their emerging reasoning capabilities, which are enabled by large-scale training and increased model capacity. However, existing LLMs can generate erroneous responses when addressing complex queries that fall outside their training distribution, due to limited internal knowledge or the need for multi-step reasoning. To address these limitations, recent work has introduced large reasoning models (LRMs), which incorporate explicit internal reasoning processes to improve response accuracy. Additionally, state-of-the-art LRMs often comprise hundreds of billions of parameters and require several seconds per inference, even on advanced multi-GPU systems. These characteristics limit their practicality for deployment in conventional computing environments. Meanwhile, NLP research on multilingual LLMs continues to prioritize high-resource languages. However, these models exhibit limited performance in under-resourced languages, primarily due to insufficient language- and culture-specific training data. In this paper, we focus on Modern Greek, for which only a limited number of question answering (QA) datasets have been proposed, most of which are intended for model evaluation. To address this research gap in Greek QA, we make the following contributions: (i) CulturaQA, a high-quality LRM-generated and human-curated dataset, for Greek LLM training and evaluation; (ii) a memory-efficient LLM evaluation framework adaptable to diverse languages and QA tasks; (iii) Maistros 8B, a state-of-the-art open-weights Greek LLM developed via knowledge distillation and fine-tuning on CulturaQA; and (iv) a comprehensive evaluation of nine LLMs across nine human-curated Greek QA datasets.


[24] StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models cs.CLPDF

Yongrui Chen, Yangyang Ma, Xiaoying Huang, Shenyu Zhang, Huajun Chen

TL;DR: 本文提出了StressEval,一个基于模型失败案例的动态数据合成框架,用于构建知识密集型推理任务的动态评测基准。该框架通过分析模型失败原因,生成具有挑战性且可控的测试实例,并以此构建了名为Dynamic OneEval的动态评测套件。

Details

Motivation: 解决静态基准测试因数据污染和过拟合而失效的问题,同时克服现有动态基准在提升难度时牺牲答案可回答性和可控性的缺陷。

Result: 在多个最先进的大语言模型上,Dynamic OneEval套件相比原始基准带来了更显著的性能下降,同时保留了明确的难度因素,便于进行更有针对性的模型迭代。

Insight: 创新点在于提出了一个由失败驱动的、三阶段(构建难度卡片、双视角实例合成、门控机制)的动态基准构建框架,其核心是将模型失败案例系统性地转化为可控的、有根据的挑战性测试,从而实现对模型能力更精准的评估和诊断。

Abstract: Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In this paper we propose StressEval a failure driven data synthesis framework that turns observed model failures into dynamic challenging and controllable test instances StressEval consists of three stages first it constructs a semi structured difficulty card that identifies the failed reasoning step and its root cause second it applies a dual perspective instance synthesis method that targets both knowledge gaps and reasoning breakdowns while preserving the underlying difficulty factors and third it applies a gating mechanism to retain only grounded unambiguous instances Seeding from multiple knowledge intensive reasoning datasets we employ StressEval to build Dynamic OneEval a focused suite of challenging dynamic benchmark Across several state of the art LLMs Dynamic OneEval yields substantially larger performance drops than the original benchmarks while retaining explicit difficulty factors enabling more actionable iteration


Weihang Su, Xuanyi Chen, Yueyue Wu, Qingyao Ai, Yiqun Liu

TL;DR: 本文提出Judge-R1框架,通过智能法律信息收集和准则引导优化,提升基于大语言模型的判决书生成质量。该框架结合动态规划代理进行多源法律信息检索,并采用基于综合法律奖励函数的强化学习进行优化,在JuDGE基准测试中显著优于现有方法。

Details

Motivation: 自动化判决书起草对司法效率至关重要,但现有方法(如检索增强生成和监督微调)存在证据召回不足、法条引用幻觉和逻辑推理缺陷等问题,需要同时改进法律信息收集和判决书生成质量。

Result: 在JuDGE基准测试上的大量实验表明,Judge-R1在法律准确性和生成质量方面显著优于最先进的基线模型,达到了SOTA水平。

Insight: 创新点在于将动态规划代理用于多源法律信息检索(Agentic Legal Information Collection)以及采用基于综合法律奖励函数的准则引导优化(Rubric-Guided Optimization),通过强化学习(GRPO)确保符合司法标准和推理逻辑,为法律文档生成提供了可借鉴的联合优化框架。

Abstract: Automating the drafting of judgment documents is pivotal to judicial efficiency, yet it remains challenging due to the dual requirements of comprehensive retrieval of legal information and rigorous logical reasoning. Existing approaches, typically relying on standard Retrieval-Augmented Generation and Supervised Fine-Tuning, often suffer from insufficient evidence recall, hallucinated statutory references, and logically flawed legal reasoning. To bridge this gap, we propose Judge-R1, a unified framework designed to enhance LLM-based judgment document generation by jointly improving legal information collection and judgment document generation. First, we introduce Agentic Legal Information Collection, which employs a dynamic planning agent to retrieve precise statutes and precedents from multiple sources. Second, we implement Rubric-Guided Optimization, a reinforcement learning phase utilizing Group Relative Policy Optimization (GRPO) with a comprehensive legal reward function to enforce adherence to judicial standards and reasoning logic. Extensive experiments on the JuDGE benchmark demonstrate that Judge-R1 significantly outperforms state-of-the-art baselines in both legal accuracy and generation quality.


[26] Counting as a minimal probe of language model reliability cs.CLPDF

Tianxiang Dai, Jonathan Fan

TL;DR: 本文通过引入稳定计数能力(Stable Counting Capacity)这一测试方法,评估大型语言模型在重复符号计数任务中的表现,发现其稳定计数能力远低于宣称的上下文限制,模型行为类似于使用有限内部状态(如数手指)进行计数,一旦资源耗尽,规则遵循能力就会崩溃,表明当前语言模型的流畅表现并不能保证其具有普遍可靠的规则遵循能力。

Details

Motivation: 研究动机在于探究大型语言模型在数学推理、编码和文档分析等基准测试上的成功表现,究竟是反映了其通用的逻辑能力、对学习过程的重复应用,还是仅仅模仿规则执行的模式匹配,旨在通过计数这一最小化任务来评估模型的程序可靠性。

Result: 在超过100个模型变体上的实验结果显示,稳定计数能力远低于广告的上下文限制,模型行为既不符合开放式逻辑,也不符合稳定应用学习规则,而是类似于使用有限内部状态进行计数,一旦资源耗尽,精确执行会崩溃为猜测。

Insight: 论文的创新点在于提出了稳定计数能力这一去除知识依赖、语义和歧义的评估方法,直接测量模型的程序可靠性,揭示了当前语言模型在规则遵循能力上的局限性,即其流畅表现可能依赖于有限的内部状态资源而非通用的逻辑能力。

Abstract: Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects general logical competence, repeated application of learned procedures, or pattern matching that mimics rule execution. We investigate this question by introducing Stable Counting Capacity, an assay in which models count repeated symbols until failure. The assay removes knowledge dependencies, semantics and ambiguity from evaluation, avoids lexical and tokenization confounds, and provides a direct measure of procedural reliability beyond standard knowledge-based benchmarks. Here we show, across more than 100 model variants, that stable counting capacity remains far below advertised context limits. Model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing, even with additional test-time compute. These findings show that fluent performance in current language models does not guarantee general, reliable rule following.


[27] A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation cs.CL | cs.AIPDF

Jingheng Pan, Xintong Wang, Longyue Wang, Liang Ding, Weihua Luo

TL;DR: 本文提出了VIDA数据集,这是一个包含2500个精心策划实例的多模态数据集,用于解决机器翻译中的视觉依赖歧义问题。作者还设计了基于LLM作为评判者的消歧中心化评估指标,并在两种最先进的大规模视觉语言模型上进行了实验,结果表明链式思维监督微调(CoT-SFT)在消歧准确率上表现更优,尤其在分布外数据上展现出更强的泛化能力。

Details

Motivation: 针对现有多模态机器翻译(MMT)数据集中存在的质量问题和与翻译场景不匹配的局限性,以及现有评估方法难以覆盖开放式翻译中更广泛的歧义类型,本文旨在构建一个高质量、视觉依赖的歧义数据集并设计更有效的评估指标。

Result: 在两种SOTA大规模视觉语言模型(LVLMs)上的实验表明,监督微调(SFT)提升了整体翻译质量,而链式思维监督微调(CoT-SFT)在消歧准确率上取得了更一致的提升,尤其在分布外子集上表现更好,显示出对多样歧义类型更强的泛化能力。

Insight: 论文的创新点在于构建了高质量的视觉依赖歧义数据集VIDA,并提出了基于LLM作为评判者的细粒度消歧评估指标。从客观角度看,其提出的链式思维监督微调(CoT-SFT)方法为利用视觉信息解决翻译歧义提供了一种有效的训练范式,有助于模型更好地理解和利用多模态信息进行推理。

Abstract: Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks that provide supportive evidence for the role of vision, we observe substantial issues in data quality and a mismatch with translation scenarios. Moreover, existing ambiguity-oriented evaluations are not well suited to broader ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art Large Vision Language Models under vanilla inference, supervised fine-tuning (SFT), and our chain-of-thought SFT (CoT-SFT) show that while SFT improves overall translation quality, CoT-SFT yields more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets, indicating a stronger generalization for resolving diverse ambiguity types.


[28] What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models cs.CL | cs.AIPDF

Ranit Karmakar, Jayita Chatterjee

TL;DR: 该论文对语言模型的可靠性进行了多变量审计,指出单一提示准确率作为主流评估方式存在局限性。研究通过分析15个开源模型在五个分类和推理基准上的表现,采用五种提示变体,测量了准确率、标记概率校准、口头置信度校准、口头解析率和提示扰动扩散等指标,揭示了评估设计、置信信号脆弱性和提示鲁棒性对可靠性结论的重要影响。

Details

Motivation: 解决单一提示准确率作为语言模型基准测试主导方式可能遗漏关键可靠性失败的问题,旨在通过多变量评估揭示模型可靠性的全面图景。

Result: 在ARC-Challenge基准上,将思维链提示与首字符评估器配对导致所有五个主要模型的表观准确率下降72-88%,但通过两个独立修复程序恢复了93.8%和102.7%的性能损失;在MMLU-Pro上,所有主要模型的口头置信度均显著高于其准确率和标记概率置信度;模型大小与提示扰动扩散的相关系数在-0.244到0.474之间波动。

Insight: 创新点在于提出多变量可靠性审计框架,强调评估管道(如校准定义、评估器逻辑、口头解析率和提示鲁棒性)对可靠性结论的关键影响,主张在做出可靠性声明时需明确报告这些因素。

Abstract: Single-prompt accuracy is the dominant way to benchmark language models, but it can miss reliability failures that matter. We evaluate a 15-model open-weight corpus, with the main reliability analyses focused on 10 instruct models across five classification and reasoning benchmarks under five prompt variants each, measuring accuracy, token-probability calibration, verbal-confidence calibration, verbal parse rate, and prompt-perturbation spread for every (model x dataset x variant) cell. We find three broad results. First, evaluation design can materially change the conclusion. Switching Expected Calibration Error (ECE) token from a raw to a label-set-normalised definition changes per-cell calibration by a mean absolute 0.149. More strikingly, pairing a chain-of-thought prompt with a first-character evaluator on ARC-Challenge reduces apparent accuracy by 72-88% across all five primary models; two independent repair procedures recover 93.8% and 102.7% of the lost performance, indicating an evaluator-side rather than model-side failure. Second, confidence signals are fragile. On MMLU-Pro, every primary model verbally reports confidence substantially above both its accuracy and its token-probability confidence on the same rows, and verbal parse rate can collapse for a single model on a single prompt variant. Third, prompt robustness does not track parameter count reliably. Across 10 instruct models, the correlation between model size and prompt-perturbation spread ranges from -0.244 to 0.474 across benchmarks. Taken together, these results show that reliability conclusions for small language models depend not only on the model being evaluated, but also on the evaluation pipeline used to measure it. We argue that calibration definitions, evaluator logic, verbal parseability, and prompt robustness should be reported explicitly when making reliability claims.


[29] Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning cs.CLPDF

Arash Ahmadi, Sarah Sharif, Yaser, Banad

TL;DR: 本文提出了一种搜索驱动的强化学习框架,通过优化奖励函数来增强大语言模型的数学推理能力。该方法将奖励函数本身作为优化对象,使用前沿语言模型生成候选奖励,通过自动验证和基于GSM8K测试集的GRPO训练进行筛选和排序,并利用多轮反馈循环迭代改进。最终,最佳奖励集成在GSM8K上取得了F1=0.795的显著提升。

Details

Motivation: 强化学习是提升大语言模型推理能力的标准后训练机制,但其性能对驱动策略优化的奖励函数设计非常敏感。在基础模型固定的情况下,奖励函数规范成为主要的设计杠杆,因此需要优化奖励函数本身以提高模型推理性能。

Result: 在GSM8K测试集上,经过五轮搜索,平均F1从第一轮的0.596提升至第五轮的0.632,最佳单个奖励达到F1=0.787。最佳奖励集成配置实现了F1=0.795(95%置信区间[0.756, 0.832]),准确率为0.660,相比仅使用基础奖励的GRPO基线(F1=0.609)获得了0.19的绝对F1增益,达到SOTA水平。

Insight: 论文的创新点在于将奖励函数规范本身作为优化对象,而非仅优化模型参数,并引入了搜索驱动的多轮反馈循环机制来自动生成和筛选奖励函数。从客观角度看,这种“元优化”方法为强化学习对齐提供了新思路,即通过自动化搜索来设计更有效的奖励信号,从而显著提升模型在特定任务(如数学推理)上的性能。

Abstract: Mathematical reasoning is a key benchmark for large language models. Reinforcement learning is a standard post-training mechanism for improving the reasoning capabilities of large language models, yet performance remains sensitive to the design of the reward function that drives policy optimization. This paper introduces a search-driven framework that treats the reward specification itself as an object of optimization. The setting of interest is one in which the base model is held fixed and the reward specification is the primary remaining design lever. Candidate reward functions are generated by a frontier language model, validated automatically, screened through 500-step Group Relative Policy Optimization (GRPO) training runs on a Llama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA), and ranked by F1 on the GSM8K test set. Ranked summaries from prior rounds are then fed back into the next round of generation. Over five rounds, the search produces 50 candidate rewards. The mean F1 rises from 0.596 in Round 1 to 0.632 in Round 5, and the top individual reward reaches F1 = 0.787. Seven ensemble configurations of top-ranked rewards are evaluated. The best ensemble achieves F1 = 0.795 (95% bootstrap CI [0.756, 0.832]) and accuracy 0.660 [0.635, 0.686], a 0.19 absolute F1 gain over a base-rewards-only GRPO baseline (F1 = 0.609). Pairwise McNemar tests with Bonferroni correction show all five-or-more-reward configurations are statistically indistinguishable at α = 0.05/21. A three-seed re-training of the best ensemble yields F1 of 0.785. A randomly drawn 5-reward control collapses to F1 = 0.047, which shows that the ranked-feedback loop, not the additive signal of having more rewards, drives the gain.


[30] EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts cs.CL | cs.AIPDF

Garvin Kruthof

TL;DR: 本文介绍了EditPropBench,一个用于评估大型语言模型(LLM)在科学手稿中传播事实性编辑的基准测试。该基准通过合成的手稿、目标编辑和受控事实图来测量编辑是否能在依赖的声明中正确传播,揭示了当前LLM编辑系统在隐式编辑传播上的局限性。

Details

Motivation: 解决科学手稿中局部事实编辑(如数据集规模变化)可能引发非局部修订义务的问题,现有LLM编辑系统缺乏对此类编辑传播能力的评估。

Result: 在困难的隐式/自由形式编辑层上,五个LLM编辑系统的编辑涟漪依从性(ERA)得分在0.148到0.705之间,即使最强的模型也遗漏了约30%的必要级联更新;在混合层压力测试中,LLM仍优于确定性替换基线。

Insight: 创新点在于提出了一个受控的手稿级基准,提供句子级依赖监督、多种编辑协议和对抗性度量探针;客观分析表明,该基准能系统评估LLM的编辑传播能力,强调了可靠科学修订需要级联感知检查。

Abstract: Local factual edits in scientific manuscripts often create non-local revision obligations. If a dataset changes from 215 to 80 documents, claims such as ‘medium-scale’ or ‘a few hundred items’ may also become stale, even though they do not repeat the edited number. We introduce EditPropBench, a benchmark for measuring whether LLM editors propagate factual edits through dependent manuscript claims. Each item contains an ML/NLP-style synthetic manuscript, a targeted edit, and a controlled fact graph with sentence-level labels for direct targets, required downstream updates, and protected unrelated text. EditPropBench provides a controlled manuscript-level benchmark with sentence-level dependency supervision, three editing protocols, adversarial metric probes, stress-test variants, and a metric suite centered on Edit-Ripple Adherence (ERA). On the hard implicit/free-form stratum, five LLM editing systems span ERA 0.148–0.705; even the strongest misses roughly 30% of required cascade updates. A mixed-stratum stress test shows that LLMs retain a positive advantage over deterministic substitution baselines when easy substitution-solvable cases are included. Finally, an audit of recent arXiv cs.CL benchmark and dataset papers finds fact-dependent qualitative claims in 37.2% of papers. EditPropBench shows that current LLM editors can repair many implicit consequences of factual edits, but reliable scientific revision still requires cascade-aware checking.


[31] ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring cs.CLPDF

Deyi Ji, Junyu Lu, Xuanyi Liu, Liqun Liu, Hailong Zhang

TL;DR: 本文提出了ARGUS系统,这是一个通过多智能体对抗性仲裁实现演化强化的策略自适应广告治理系统,旨在解决在线广告治理中因监管政策非平稳性导致的历史数据标签不一致和推理模糊性问题。

Details

Motivation: 在线广告治理面临监管政策非平稳性的重大挑战,新出现的政策指令(如对教育或审美焦虑的限制)会导致历史数据集中的标签严重不一致和推理模糊,需要一种能适应政策演变的治理方法。

Result: 在工业和公共数据集上的大量实验表明,ARGUS显著优于传统的微调基线方法,能够以极少的黄金数据实现卓越的策略自适应学习。

Insight: 创新点在于提出了一个三阶段框架(策略播种、对抗性标签校正、潜在知识发现)以及“检察官-辩护人-仲裁员”架构,并利用RAG增强的策略知识和思维链合成作为强化学习的动态奖励,使推理路径与不断演变的法规同步。

Abstract: Online advertising governance faces significant challenges due to the non-stationary nature of regulatory policies, where emerging mandates (e.g., restrictions on education or aesthetic anxiety) create severe label inconsistencies and reasoning ambiguities in historical datasets. In this paper, we propose ARGUS, a policy-adaptive governance system that enables evolving reinforcement through multi-agent adversarial umpiring. ARGUS addresses the sparsity of new policy data by employing a three-stage framework: (1) Policy Seeding for initial perception; (2) Adversarial Label Rectification, which utilizes a Prosecutor-Defender-Umpire'' architecture to resolve conflicts between stale labels and new mandates; and (3) Latent Knowledge Discovery, which employs a tripartite dialectical discussion to unearth sophisticated, gray-area’’ violations. By leveraging RAG-enhanced policy knowledge and Chain-of-Thought synthesis as dynamic rewards for reinforcement learning, ARGUS synchronizes its reasoning pathways with evolving regulations. Extensive experiments on both industrial and public datasets demonstrate that ARGUS significantly outperforms traditional fine-tuning baselines, achieving superior policy-adaptive learning with minimal gold data.


[32] Compositional Multi-hop Factual Error Correction via Decomposition-and-Injection cs.CLPDF

Lei Zhu, Xiaobao Wang, Jianbiao Yang, Chenyang Wang, Dongxiao He

TL;DR: 本文提出了CECoR框架,一种用于组合式多跳事实错误校正的推理感知方法。该方法通过分解-注入范式,将多跳声明分解为可解释的推理步骤,并注入受控扰动以合成高质量训练对,结合监督微调和强化学习的两阶段策略提升事实准确性和鲁棒性。

Details

Motivation: 现有事实错误校正方法在处理单跳校正时表现良好,但将声明视为原子单元,难以应对需要跨多个证据源进行组合推理的多跳情况,且受限于配对数据不足和复杂推理链中语义错误定位困难。

Result: 综合评估表明,CECoR在多跳基准测试中实现了强劲性能,优于远程监督方法和少样本LLM基线,并能有效泛化到单跳校正且在噪声证据下保持稳定。

Insight: 创新点在于引入了分解与注入的范式,将多跳声明结构化分解并合成训练数据,以及结合监督学习与强化学习的两阶段策略,提升了模型对组合推理和噪声的鲁棒性,为实际应用提供了通用性。

Abstract: Factual Error Correction (FEC) aims to revise inaccurate text into statements that are factually consistent with external evidence. Although recent methods perform well on single-hop correction, they often treat claims as atomic units and struggle with multi-hop cases that require compositional reasoning across multiple evidence sources. This challenge is further amplified by limited paired data and difficulties in locating semantic errors within complex reasoning chains. We present CECoR (Compositional Error Correction via Reasoning-aware Synthesis), a reasoning-aware framework that introduces a Decomposition and Injection paradigm for compositional error correction. CECoR decomposes multi-hop claims into interpretable reasoning steps and injects controlled perturbations to synthesize high-quality training pairs. A two-stage learning strategy combining supervised fine-tuning and reinforcement learning improves factual accuracy and robustness. Comprehensive evaluations show that CECoR achieves strong performance on multi-hop benchmarks, outperforming both distantly supervised methods and few-shot LLM baselines. It also generalizes effectively to single-hop correction and remains stable under noisy evidence, demonstrating its versatility for real-world factual correction.


[33] MolViBench: Evaluating LLMs on Molecular Vibe Coding cs.CLPDF

Jiatong Li, Yuxuan Ren, Weida Wang, Changmeng Zheng, Xiao-yong Wei

TL;DR: 本文介绍了MolViBench,这是首个专门为分子氛围编码设计的基准测试,用于评估大语言模型在化学领域生成可执行代码的能力。该基准包含358个任务,涵盖五个认知层次和12个真实药物发现工作流程,并提出了一个结合类型感知输出比较和基于AST的API语义回退分析的多层评估框架。

Details

Motivation: 现有基准测试在评估大语言模型时存在脱节:通用代码生成基准缺乏化学知识要求,而化学基准侧重于知识回忆或性质预测,而非可执行代码生成。因此,需要一个新的基准来填补这一空白,专门评估大语言模型在分子氛围编码中所需的编程、分子理解和领域特定推理能力。

Result: 研究系统评估了9个前沿编码大语言模型,并比较了三种真实世界的分子氛围编码范式,为诊断大语言模型在AI加速分子发现中的编码能力提供了一个实用且细粒度的测试平台。

Insight: 创新点在于首次提出了专门针对分子氛围编码的基准测试MolViBench,并设计了一个多层评估框架,结合了类型感知输出比较和AST-based API语义回退分析,以联合衡量代码的可执行性和化学正确性,为化学与AI交叉领域提供了更全面的评估工具。

Abstract: Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molecular understanding, and domain-specific reasoning capabilities. However, existing benchmarks remain disconnected. General code generation benchmarks such as HumanEval and SWE-bench require no chemistry knowledge, while chemistry-focused benchmarks such as S^2-Bench and ChemCoTBench evaluate knowledge recall or property prediction rather than executable code generation. To bridge this gap, we introduce MolViBench, the first benchmark tailored for Molecular Vibe Coding. MolViBench comprises 358 curated tasks across five cognitive levels, ranging from single-API recall to end-to-end virtual screening pipeline design, spanning 12 real-world drug discovery workflows. To rigorously assess generated code, we also propose a multi-layered evaluation framework that combines type-aware output comparison and AST-based API-semantic fallback analysis, which jointly measures executability and chemical correctness. We systematically evaluate 9 frontier coding LLMs and compare three real-world Molecular Vibe Coding paradigms, providing a practical and fine-grained testbed for diagnosing LLMs’ coding capabilities in AI-accelerated molecular discovery.


[34] Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval cs.CL | cs.AI | cs.IRPDF

Valentin Knappich, Anna Hätty, Simon Razniewski, Annemarie Friedrich

TL;DR: 本文提出了FiNE-Patents数据集,将专利新颖性评估从权利要求级别的二元分类任务,转变为特征级别的联合检索与抽象推理任务,并基于LLM的工作流程在该任务上超越了基于嵌入的基线方法。

Details

Motivation: 现有工作将专利新颖性预测视为权利要求级别的二元分类任务,这容易受到虚假相关性的影响,且缺乏实际应用所需的细粒度。

Result: 实验表明,基于LLM的工作流程在段落检索和新颖特征识别任务上优于基于嵌入的基线方法,并且在权利要求级别的新颖性分类任务中对虚假相关性具有鲁棒性。

Insight: 核心创新在于提出了一个细粒度的、特征级别的专利新颖性评估新范式(联合检索与抽象推理),并构建了相应的标注数据集,这为透明和细粒度的专利分析研究提供了新的方向和数据基础。

Abstract: Novelty assessment is a critical yet complex task in the examination process for patent acceptance, requiring examiners to determine whether an invention is disclosed in a prior art document. The process involves intricate matching between specific features of a patent claim and passages in the prior art. While prior work has approached novelty prediction primarily as a binary classification task at the claim level, we argue that this formulation is susceptible to spurious correlations and lacks the granularity required for practical application. In this work, we introduce FiNE-Patents (Fine-grained Novelty Examination of Patents), a novel dataset comprising 3,658 first patent claims annotated with fine-grained, feature-level prior art references extracted from European Search Opinion (ESOP) documents. We propose shifting the evaluation paradigm from simple binary classification to a joint retrieval and abstract reasoning task at the feature level, requiring models to identify specific passages from a prior art document that disclose individual claim features, and to identify which features of a claim make it novel. We implement and evaluate LLM-based workflows that decompose claims into features, analyze each feature against prior art, and finally derive a claim-level novelty prediction. Our experiments demonstrate that these workflows outperform embedding-based baselines on passage retrieval and novel feature identification. Furthermore, we show that unlike trained classifiers, LLMs are robust against spurious correlations present in the claim-level novelty classification task. We release the dataset and code to foster further research into transparent and granular patent analysis.


[35] PC-MNet: Dual-Level Congruity Modeling for Multimodal Sarcasm Detection via Polarity-Modulated Attention cs.CL | cs.AIPDF

Maoheng Li, Ling Zhou, Xiaohua Huang, Rubing Huang, Wenming Zheng

TL;DR: PC-MNet提出了一种通过极性调制注意力进行双层次一致性建模的多模态讽刺检测方法,旨在精确识别文本字面含义与非语言线索之间的语用不一致性。该方法通过引入标量一致性路由机制和先验引导的上下文图,采用两阶段非对称优化和基于不一致性的对比学习,选择性地融合最具区分性的多粒度证据,从而有效建模人类交流中的微妙语用不一致。

Details

Motivation: 当前多模态讽刺检测方法主要依赖基于朴素相似性的注意力机制和统一的后期融合策略,且传统后期融合存在功能纠缠问题,限制了模型对语用不一致性的精确建模能力。

Result: 在MUStARD基准测试及其缓解虚假相关性的平衡数据集上进行的大量实验表明,该方法实现了新的最先进性能,在Macro-F1指标上比最强的多模态基线显著提升了3.14%。

Insight: 创新点在于通过标量一致性路由机制和先验引导的上下文图,结合两阶段非对称优化和对比学习,实现了对原子、组合和上下文冲突的架构性隔离,为建模人类交流中的微妙语用不一致性提供了一个鲁棒且解耦的范式。

Abstract: Multimodal sarcasm detection, which aims to precisely identify pragmatic incongruities between literal text and nonverbal cues, has gained substantial attention in multimodal understanding. Recent advancements have predominantly relied on naïve similarity-based attention mechanisms and uniform late fusion strategies.Furthermore, given that functional entanglement restricts traditional late fusions, we incorporate a scalar congruity routing mechanism and a prior-guided contextual graph. This mechanism anchors a generalized incongruity manifold through a two-stage asymmetric optimization driven by inconsistency-aware contrastive learning, selectively fusing only the most discriminative multi-granularity evidence. Extensive experiments on the \texttt{MUStARD} benchmark and its spurious-correlation-mitigated balanced datasets demonstrate that our approach achieves new state-of-the-art performance, surpassing the strongest multimodal baseline by a substantial 3.14% improvement in Macro-F1. By architecturally isolating atomic, composition, and contextual conflicts. This work provides a robust, decoupled paradigm for modeling subtle pragmatic incongruities in human communication.


Stanisław Sójka, Witold Kowalczyk

TL;DR: 本文提出了一种名为’摊销智能’的神经符号方法,用于大规模、准确的法律推理。该方法利用大语言模型一次性将法律文本转换为一种确定性的、类型化的图中间表示——确定性自主合约语言。后续的裁决过程则依赖于确定性的图执行,并生成可视化可审计的追踪记录。

Details

Motivation: 解决法律文本中计算性法律条款的理解难题,克服前沿大推理模型在生产系统中存在的推理错误、高推理成本以及缺乏严格可审计性的限制。

Result: 与基于运行时大推理模型的基线相比,基于DACL的智能体实现了近乎完美的推理一致性,缓解了概率模型中的’推理悬崖’问题。在高吞吐量工作流中,系统计算成本降低了90%以上,同时满足了法律裁决的严格可审计性要求。

Insight: 核心创新在于将一次性神经翻译与确定性符号执行解耦的’神经符号卸载’架构,以及通过类型化图表示和可视化追踪实现的结构化可审计性,为构建鲁棒、高效、可信的生产级法律推理系统提供了新范式。

Abstract: Legal texts often contain computational legal clauses–provisions whose understanding requires complex logic. While frontier Large Reasoning Models (LRMs) can describe such clauses, building production-ready systems is limited by reasoning errors and the high cost of inference. We propose Amortized Intelligence, a neuro-symbolic approach where we use an LLM once to translate a legal text into Deterministic Autonomous Contract Language (DACL): a typed graph intermediate representation. Adjudication then relies on deterministic graph executions with a visually auditable trace. In comparison against runtime LRM baselines (including GPT-5.2 and Gemini 3 Pro), our DACL-based Agent achieves near-perfect consistency and mitigates the “reasoning cliff” observed in probabilistic models. The system reduces compute costs by over 90% in high-volume workflows while satisfying the strict auditability requirements of legal adjudication.


[37] Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces cs.CLPDF

Chenchen Zhang

TL;DR: 本文提出通过编排轨迹(orchestration traces)——即包含子智能体生成、委托、通信、工具使用、返回、聚合和停止决策等事件的时序交互图——来研究基于大语言模型(LLM)的多智能体系统的强化学习(RL)。论文系统分析了奖励设计、信用分配和学习分解三个技术轴,并揭示了当前研究中停止决策RL训练的缺失。

Details

Motivation: 随着LLM智能体从孤立工具使用者发展为协同团队,强化学习需要优化不仅是个体行动,还包括工作如何生成、委托、通信、聚合和停止等编排过程。

Result: 论文未报告具体定量结果,但通过梳理截至2026年5月4日的84篇文献池,发现当前公开学术评估中缺乏针对停止决策的显式RL训练方法,并与工业系统(如Kimi Agent Swarm、OpenAI Codex、Anthropic Claude Code)的部署证据进行了对比。

Insight: 创新点在于提出了编排轨迹作为多智能体RL的分析框架,并系统划分了奖励设计(包括并行加速、拆分正确性、聚合质量等八类)、信用分配单元(从令牌到团队八个层级)以及编排学习的五个子决策(生成、委托、通信、聚合、停止),其中停止决策的RL训练是当前研究空白。

Abstract: As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.


[38] FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents cs.CLPDF

Quang Hieu Pham, Yang He, Ping Nie, Canwen Xu, Davood Rafiei

TL;DR: FlexSQL是一种灵活的文本到SQL代理,其核心设计原则是支持在推理过程中随时探索数据库模式、检查数据值和运行验证查询。它通过生成多样化的执行计划来覆盖多种查询解释,并根据任务选择SQL或Python实现,采用双层修复机制从代码级错误回溯到计划级修订。

Details

Motivation: 当前大多数文本到SQL系统采用固定管道,即预先一次性检索模式元素,仅在事后修复时重新访问数据库,这限制了从早期错误中恢复的能力。FlexSQL旨在通过灵活的数据库交互来解决这一问题,以应对大型分析数据库中的复杂模式、模糊查询和基于实际数据的决策需求。

Result: 在Spider2-Snow基准测试上,使用gpt-oss-120b模型,FlexSQL取得了65.4%的分数,优于使用更强、更大模型(如gpt-o3和DeepSeek-R1)的强开源基线。当集成到通用编码代理(如Claude Code中的技能)时,该方法在Spider2-Snow上实现了超过10%的相对改进。

Insight: 创新点在于将灵活性作为核心设计原则,允许代理在推理过程中动态探索和执行,包括生成多样化执行计划、根据任务选择实现语言(SQL或Python),以及双层修复机制。客观分析表明,灵活的探索和灵活的执行为方法的有效性提供了关键贡献,这为文本到SQL代理的设计提供了新的方向。

Abstract: Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems follow a fixed pipeline where schema elements are retrieved once upfront and the database is only revisited for post-hoc repair, limiting recovery from early mistakes. We present FlexSQL, a text-to-SQL agent whose core design principle is flexible database interaction: the agent can explore schema structure, inspect data values, and run verification queries at any point during reasoning. FlexSQL generates diverse execution plans to cover multiple query interpretations, implements each plan in either SQL or Python depending on the task, and uses a two-tiered repair mechanism that can backtrack from code-level errors to plan-level revisions. On Spider2-Snow, using gpt-oss-120b, FlexSQL achieves a 65.4% score, outperforming strong open-source baselines that use stronger, larger models such as gpt-o3 and DeepSeek-R1. When integrated into a general-purpose coding agent (as skills in Claude Code), our approach yields over 10% relative improvement on Spider2-Snow. Further analysis shows that flexible exploration and flexible execution jointly contribute to the effectiveness of our approach, highlighting flexibility as a key design principle. Our code is available at: https://github.com/StringNLPLAB/FlexSQL


cs.CV [Back]

[39] Synthetic Designed Experiments for Diagnosing Vision Model Failure cs.CV | cs.LGPDF

Krisanu Sarkar

TL;DR: 本文提出了一种名为SDRS的新方法,用于诊断视觉模型的失败原因。该方法借鉴实验设计(DoE)的统计理论,将合成数据生成器视为实验装置,通过析因设计高效地审计模型对不同场景因素的敏感性,并将失败归类为两种可操作的差距类型,进而生成有针对性的合成数据进行纠正。

Details

Motivation: 当前合成数据生成方法只是随机采样,无法诊断下游模型的具体需求,未能有效利用合成数据可控、可独立变化场景因素的独特优势。本文旨在解决这一根本性误用,为模型失败提供可诊断、可操作的解决方案。

Result: 在三个实验中验证了SDRS的有效性:1)在植入偏见的dSprites数据集上,审计正确识别了两种差距类型,针对性数据将准确率从49.9%提升至79.0%;2)在程序化场景的密集分割任务中,检测到背景复杂性捷径后应用针对性数据,将mIoU从0.948提升至0.998;3)在纠缠检测实验中,ANOVA审计识别了不完美生成器中的跨因素污染。

Insight: 核心创新在于将实验设计(DoE)的统计思想系统性地引入合成数据生成与模型诊断流程,提出了一个可操作的审计框架(SDRS),能够将模型失败归因于具体的场景因素(如覆盖不足或虚假依赖),并据此生成针对性数据。这为理解模型行为、超越简单的数据增强提供了新的方法论。同时,研究也揭示了基于表示的不变性惩罚可能导致敏感性在因素间转移,这是一个有待解决的开放性问题。

Abstract: Current synthetic data pipelines for computer vision generate images without diagnosing what the downstream model actually needs. This open-loop paradigm treats synthetic data as cheap real data, randomly sampling the generator’s output space and hoping to cover the model’s failure modes. We argue this fundamentally misuses synthetic data’s unique property: the controllable, independent variation of scene factors.Drawing on the statistical theory of Design of Experiments (DoE), we propose Synthetic Designed Experiments for Representational Sufficiency (SDRS). SDRS treats the downstream model as a black-box system and the synthetic generator as an experimental apparatus. Using fractional factorial designs, SDRS efficiently audits a model’s factor-sensitivity profile via ANOVA decomposition. It classifies failures into two actionable types: Type I gaps (coverage failures on underrepresented factor levels) and Type II gaps (reliance on spurious nuisance dependencies). The audit then prescribes targeted synthetic data to address each gap type. We validate SDRS on three experiments: (1) a controlled diagnostic on dSprites with planted biases, where the audit correctly identifies both gap types and targeted data improves accuracy from 49.9% to 79.0%; (2) a dense segmentation task on procedural scenes, where detecting background-complexity shortcuts and applying targeted data improves mIoU from 0.948 to 0.998; and (3) an entanglement detection experiment showing that the ANOVA audit identifies cross-factor contamination in imperfect generators. Finally, we show that per-factor invariance penalties can transfer sensitivity between factors, identifying an open problem for representation-level correction.


[40] Latent Space Probing for Adult Content Detection in Video Generative Models cs.CV | cs.AI | cs.LG | cs.MMPDF

Alizishaan Khatri, Chiquita Prabhu

TL;DR: 本文提出了一种新颖的潜在空间探测框架,用于视频生成模型中的成人内容检测。该方法通过拦截CogVideoX视频扩散模型在推理过程中产生的去噪潜在表示,并附加轻量级分类器来实现实时检测。研究构建了一个包含11039个视频片段的大规模二元数据集,并引入了两种轻量级探测分类器架构。实验表明,该方法在保持低延迟(4-6毫秒)的同时,在测试集上达到了97.29%的F1分数。

Details

Motivation: 现有基于提示词或解码后像素空间的检测方法无法利用生成过程中形成的丰富内部表示,导致内容审核面临挑战。本文旨在通过探测视频扩散模型的潜在空间来更有效地检测成人内容。

Result: 在构建的大规模数据集上,所提方法在留出测试集上达到了97.29%的F1分数,推理开销在4-6毫秒范围内,表明其在检测性能和成本上均有提升。

Insight: 核心创新点在于首次提出并验证了在视频扩散模型的推理过程中,直接探测其去噪潜在空间表示对于有害内容检测的有效性。这为内容审核提供了一种新的、高效且低延迟的解决方案,利用了生成模型内部表征的判别性特征。

Abstract: The rapid proliferation of AI-powered video generation systems has introduced significant challenges in content moderation, particularly with respect to adult and sexually explicit material. Existing detection methods operate on either prompts or decoded pixel-space outputs. Therefore, both approaches are blind to the rich internal representations formed during generation. In this paper, we propose a novel latent space probing framework that intercepts the denoised latent representations produced by the CogVideoX video diffusion model during inference and attaches lightweight classifiers to perform real-time adult content detection. To support this work, we construct a large-scale binary dataset of 11039 ten-second video clips (5086 violating, 5953 non-violating) sourced from adult websites and YouTube respectively. We introduce two lightweight probing classifier architectures. We train and evaluate it on the dataset. Our work demonstrates that latent-space signals encode strong discriminative features for harmful content detection, achieving 97.29% F1 on our held-out test set with an overhead in the 4-6ms range. Our results suggest that probing the latent space results in improvements in both detection performance as well as cost.


[41] Visual Chart Representations for Cryptocurrency Regime Prediction: A Systematic Deep Learning Study cs.CV | cs.AIPDF

Dustin M. Haggett

TL;DR: 本文系统研究了不同视觉表示方法在加密货币市场状态预测中的应用,通过对比多种图像编码方法、图表组件配置、神经网络架构以及ImageNet迁移学习的影响,发现简单的原始蜡烛图配合4层CNN在比特币、以太坊和标普500数据上取得了最佳性能(AUC-ROC 0.892),且简化表示(仅价格图表、128x128分辨率)优于复杂替代方案。

Details

Motivation: 解决深度学习在金融图表图像分析中应用不足的问题,探索视觉表示对加密货币市场状态预测的有效性。

Result: 在2018-2024年比特币、以太坊和标普500数据上的八个控制实验表明,原始蜡烛图配合4层CNN达到0.892 AUC-ROC,优于大型预训练模型;迁移学习提升性能4-16%。

Insight: 创新点在于系统比较视觉表示方法,揭示简单表示和轻量模型在金融图表分类中的优越性;客观分析表明,领域差距下迁移学习仍有效,且GradCAM可解释性分析增强了方法可信度。

Abstract: Technical traders have long relied on visual analysis of candlestick charts to identify market patterns and predict price movements. While deep learning has achieved remarkable success in image classification, its application to financial chart images remains underexplored. This paper presents a systematic study comparing different visual representations for cryptocurrency regime prediction. We evaluate three image encoding methods (raw candlestick charts, Gramian Angular Fields, and multi-channel GAF), five chart component configurations, four neural network architectures (CNN, ResNet18, EfficientNet-B0, and Vision Transformer), and the impact of ImageNet transfer learning. Through eight controlled experiments on Bitcoin, Ethereum, and S&P 500 data spanning 2018-2024, we identify optimal configurations for visual regime classification. Our results show that a simple 4-layer CNN on raw candlestick charts achieves 0.892 AUC-ROC, outperforming larger pretrained models. Surprisingly, simpler representations (price-only charts, 128x128 resolution) consistently outperform more complex alternatives. We provide interpretability analysis using GradCAM and demonstrate that transfer learning improves performance by 4-16% despite the domain gap between natural images and financial charts.


[42] Adversarial Flow Matching for Imperceptible Attacks on End-to-End Autonomous Driving cs.CV | cs.AIPDF

Xinyu Zeng, Xiangkun He, Lei Tao, Chen Lv, Hong Cheng

TL;DR: 该论文提出了一种名为Adversarial Flow Matching (AFM) 的新型灰盒攻击框架,旨在针对依赖Transformer骨干网络的端到端自动驾驶模型生成视觉上难以察觉的对抗性样本。AFM通过扰动生成式潜在空间和神经平均速度场,实现了高效的一步式对抗样本生成,并在攻击有效性和视觉不可感知性之间取得了优越的平衡。

Details

Motivation: 端到端自动驾驶模型(无论是单体模型还是模块化架构)日益依赖Transformer骨干网络进行复杂推理,这可能带来共同的安全漏洞:视觉上难以察觉的扰动可以通过针对Transformer模块来操纵模型做出危险驾驶行为。现有对抗攻击方法通常需要完全模型透明度(白盒),或存在查询延迟高、攻击可迁移性有限(黑盒)的问题。

Result: 大量实验表明,AFM在攻击有效性和视觉不可感知性之间取得了优越的权衡。与基线方法相比,AFM显著降低了Vision-Language-Action (VLA) 和模块化自动驾驶智能体在各种场景下的性能,同时保持了最先进的视觉不可感知性。此外,AFM生成的对抗样本展现出强大的跨模型可迁移性。

Insight: 论文的创新点在于提出了一个灰盒攻击框架AFM,它仅需目标模型包含基于Transformer模块的先验知识,通过利用Transformer的结构性漏洞,结合生成式潜在空间和神经平均速度场的协同扰动,实现了高效、有效且视觉上难以察觉的攻击。这揭示了依赖Transformer的端到端自动驾驶模型可能存在共同的结构性安全弱点,为模型鲁棒性评估和防御提供了新的视角。

Abstract: Autonomous driving (AD) is evolving towards end-to-end (E2E) frameworks through two primary paradigms: monolithic models exemplified by Vision-Language-Action (VLA), and specialized modular architectures. Despite their divergent designs, both paradigms increasingly rely on Transformer backbones for complex reasoning, potentially causing a shared vulnerability: visually imperceptible perturbations can manipulate E2E AD models into hazardous maneuvers by targeting the Transformer module. Most existing adversarial attack approaches against AD systems operate under white-box or black-box settings; yet, they typically necessitate full model transparency, or suffer from either prohibitive query latency or limited attack transferability. In this paper, we propose Adversarial Flow Matching (AFM), a novel gray-box attack framework that exploits Transformer structural vulnerabilities in E2E AD models. AFM enables efficient one-step generation of adversarial examples via a neural average velocity field. Additionally, the proposed technique yields effective and visually imperceptible attacks by synergistically perturbing the generative latent space and the neural average velocity field. Extensive experiments demonstrate that AFM achieves a superior trade-off between attack effectiveness and imperceptibility: it substantially degrades the performance of both VLA and modular AD agents across various scenarios compared to baselines, while maintaining state-of-the-art visual imperceptibility. Furthermore, adversarial examples generated by AFM exhibit robust cross-model transferability, indicating that AFM closely approximates a black-box attack setting while requiring only the prior knowledge that the target AD model incorporates a Transformer-based module.


[43] Intervention-Based Self-Supervised Learning: A Causal Probe Paradigm for Remote Photoplethysmography cs.CVPDF

Zhiyi Niu, Xiaoguang Tu, Bo Zhao, Junzhe Cao, Dan Guo

TL;DR: 本文提出了一种名为生理因果探测(PCP)的新型自监督学习范式及其实现框架Interv-rPPG,用于解决远程光电容积描记(rPPG)任务中现有自监督学习方法易陷入相关性陷阱、学习到运动或光照噪声而非真实微弱生理信号的问题。该范式通过主动、精确的干预来验证rPPG假设的物理真实性,从而提升模型泛化能力。

Details

Motivation: 现有rPPG自监督学习方法容易陷入相关性陷阱,倾向于学习数据中最主要的周期性信号(如高能量的运动或光照噪声),而不是微弱的真实rPPG信号,导致模型泛化能力差。

Result: 在VIPL-HR和MMPD等具有挑战性的数据集上,所提方法提升了域内和跨域性能。在复杂的跨数据集设置中,其表现甚至超过了有监督基线;在干净的、干预机制可能引入轻微残留色度噪声的数据集上,其性能也保持竞争力。广泛的实验(包括对干扰因素敏感性的诊断分析)表明,PCP范式能有效抵抗运动和光照伪影。

Insight: 核心创新点是将学习范式从被动的相关性学习转向主动、精确的干预验证,提出了生理因果探测(PCP)范式。具体实现上,通过提出的rPPG假设对视频进行干预,并利用‘可证伪性归零’和‘公理等变性’来验证假设的物理真实性,其可控生理信号编辑器通过对视频低频色度分量进行干预来实现对rPPG信号的精确编辑。

Abstract: Remote Photoplethysmography (rPPG) enables convenient non-contact physiological measurement. Existing Self-Supervised Learning (SSL) methods commonly fall into a correlation trap: they tend to learn the most dominant periodic signals in the data, such as high-energy motion or illumination noise, rather than the faint, true rPPG signal, leading to poor model generalization. To address this, we propose a new SSL paradigm, Physiological Causal Probing (PCP), which treats the latent rPPG signal as the underlying physical source and the resulting pixel chrominance variations as its visual manifestation. Its core idea is to shift from passive correlation learning to active, precise intervention: it intervenes on the video based on a proposed rPPG hypothesis, and verifies whether the post-intervention changes match physical expectations. We propose the Interv-rPPG framework to implement PCP: an rPPG extractor named PhysMambaFormer hypothesizes the rPPG signal, while a Controllable Physiological Signal Editor conducts precise chrominance-domain interventions on videos based on this hypothesis. Interv-rPPG validates the physical realism of the hypothesis through Falsifiability via Nulling' and Axiomatic Equivariance’. Our editor achieves precise editing of the rPPG signal by intervening in the low-frequency chrominance components of the video. Our method improves both in-domain and cross-domain performance on challenging datasets such as VIPL-HR and MMPD. Furthermore, it surpasses the supervised baseline in complex cross-dataset settings, while remaining competitive on clean datasets where the intervention mechanism may introduce slight residual chrominance noise. Extensive experiments, including diagnostic analysis of nuisance sensitivity, demonstrate that the PCP paradigm effectively resists motion and illumination artifacts.


[44] LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception cs.CVPDF

Justn williams, Kishor Datta Gupta, Roy George, Mrinmoy Sarkar

TL;DR: 本文提出了LiteVLA-H,一个专为无人机机载部署设计的256M参数视觉-语言-动作(VLA)模型。其核心创新在于支持双速率推理:一个快速的外环制导模式用于低延迟动作输出,一个较慢的语义模式用于场景理解与描述。通过在NVIDIA Jetson AGX Orin嵌入式平台上实现,模型能以约19.74Hz的频率输出动作令牌,同时以约6-7Hz的频率提供语义输出。

Details

Motivation: 现有VLA模型在操作任务中表现出良好的语义理解和任务泛化能力,但难以部署在无人机上,因为无人机需要在严格的机载计算和通信限制下进行低延迟的闭环制导。

Result: 在NVIDIA Jetson AGX Orin平台上,LiteVLA-H的动作分支推理延迟为50.65ms(19.74Hz),语义分支延迟为149.90-164.57ms(6.08-6.67Hz)。与AnywhereVLA、FutureVLA、ReMem-VLA等最新SOTA架构相比,在相同部署条件下,其动作分支达到了更高的边缘推理速率,同时保持了周期性的语义感知能力。

Insight: 主要创新点在于:1)针对紧凑边缘计算场景,观察到端到端延迟主要由多模态预填充主导,而非解码额外令牌的边际成本,从而设计了双速率调度器。2)采用了一种知识保留的微调方法,混合了反应式飞行数据、航空语义数据以及通用描述/VQA监督,使模型在保持描述能力的同时适应特定任务。

Abstract: Vision-language-action (VLA) models have shown strong semantic grounding and task generalization in manipulation, but aerial deployment remains difficult because drones require low-latency closed-loop guidance under strict onboard compute and communication constraints. We present LiteVLA-H, a compact 256M-parameter VLA system designed for dual-rate operation on an NVIDIA Jetson AGX Orin: a fast outer-loop guidance mode for short action-token outputs and a slower semantic mode for scene understanding, hazard description, and operator-facing narration. The central empirical observation is that, in this compact edge regime, end-to-end latency is dominated by multimodal pre-fill rather than by the marginal cost of decoding a few extra tokens. This motivates a scheduler that issues reactive action tokens at 50.65,ms (19.74,Hz) while still supporting sentence-level semantic outputs at 149.90–164.57\ms (6.08–6.67,Hz) on the same embedded platform. To specialize the model without collapsing its descriptive competence, we use a knowledge-preserving fine-tuning recipe that mixes reactive flight data, aerial semantic data, and generic caption/VQA supervision. Beyond reporting current latency measurements, we position the system against recent state-of-the-art architectures, including AnywhereVLA, FutureVLA, and ReMem-VLA, showing that the measured action branch reaches a higher edge inference rate under our deployment conditions while retaining periodic semantic awareness.


[45] X2SAM: Any Segmentation in Images and Videos cs.CV | cs.AIPDF

Hao Wang, Limeng Qiao, Chi Zhang, Lin Ma, Guanglu Wan

TL;DR: 本文提出X2SAM,一个统一的分割多模态大语言模型,旨在将任意分割能力从图像扩展到视频。它通过结合LLM与掩码记忆模块,能够根据对话指令和视觉提示,在图像和视频中执行多种分割任务,并保持时间一致的视频掩码生成。

Details

Motivation: 现有MLLMs在像素级感知(尤其是视频)方面有限,而基础分割模型(如SAM系列)无法理解复杂对话指令。现有分割MLLMs通常专用于图像或视频,且很少同时支持文本和视觉提示。X2SAM旨在填补这一空白,实现图像与视频的统一分割。

Result: X2SAM在提出的视频视觉接地(V-VGD)分割基准上表现出强大的视频分割性能,在图像分割基准上保持竞争力,并保留了通用的图像和视频对话能力。

Insight: 创新点包括统一的图像-视频分割MLLM架构、掩码记忆模块以实现时间一致的视频掩码生成,以及支持多种分割任务(如开放词汇、指代、推理、交互式分割)的单一接口。客观来看,其联合训练策略和V-VGD基准的引入对推动视频分割研究具有价值。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.


[46] Retrieval-Guided Generation for Safer Histopathology Image Captioning cs.CV | cs.AI | cs.IRPDF

Md. Enamul Hoq, Wataru Uegami, Saghir Alfasly, Ghazal Alabtah, Sahar Rahimi Malakshan

TL;DR: 本文提出了一种检索引导生成(RGG)方法,用于生成更安全、更可靠的病理学图像描述。该方法通过检索视觉上相似的病例的专家文本并加以总结来形成描述,而不是从头生成。在ARCH病理学数据集上的实验表明,该方法在语义对齐、术语准确性和减少错误诊断方面优于完全生成式模型(如MedGemma)。

Details

Motivation: 解决生成式视觉语言模型在生成医学图像描述时存在的幻觉、过度具体的诊断断言和事实不一致等严重问题,尤其是在病理学领域,这些错误可能导致严重后果。

Result: 在ARCH病理学数据集上,RGG方法将描述与真实标注的余弦相似度从MedGemma的约0.47提升至约0.60,置信区间不重叠,表明提升显著。病理学家定性评估也显示其能更好地保留形态学相关术语并减少无支持的诊断。

Insight: 核心创新点在于采用检索引导而非完全生成的范式来构建描述,这提高了透明度和可靠性,便于审计。从客观角度看,该方法通过利用已有专家知识库,有效缓解了生成模型在专业领域的事实性错误问题,为高风险领域的AI应用提供了一种更安全的替代方案。

Abstract: Generative vision-language models can produce fluent medical image captions but remain prone to hallucination, over-specific diagnostic claims, and factual inconsistency-serious issues in pathology. We investigate retrieval-guided generation (RGG) as a safer alternative, where captions are formed by summarizing expert text from visually similar cases rather than generated de novo. On the ARCH histopathology dataset, RGG improves semantic alignment with ground truth, achieving cosine similarity of $\approx$0.60 versus $\approx$0.47 from MedGemma, with non-overlapping confidence intervals indicating a robust gain. A pathologist-led qualitative review shows better preservation of morphology-relevant terminology and fewer unsupported diagnoses, while revealing failure modes such as concept mixing and inherited over-specific labeling. Overall, retrieval-guided captioning offers a more transparent and reliable approach with clearer opportunities for auditing than fully generative methods.


[47] Dino-NestedUNet: Unlocking Foundation Vision Encoders for Pathology Tumor Bulk Segmentation via Dense Decoding cs.CVPDF

Tianyang Wang, Ziyu Su, Abdul Rehman Akbar, Usama Sajjad, Usman Afzaal

TL;DR: 本文提出Dino-NestedUNet框架,通过将预训练的DINOv3视觉基础模型编码器与嵌套密集解码器耦合,解决了现有方法在病理学浸润性肿瘤块分割中因容量不匹配导致的边界保真度受限问题。该解码器采用密集中间路径网格实现连续特征复用和多尺度重校准,以在重建过程中对齐高层语义与低层形态纹理。

Details

Motivation: 当前许多方法将冻结的视觉基础模型与轻量级解码器配对,导致容量不匹配,限制了浸润性肿瘤块分割的边界保真度。本文旨在通过密集解码策略更好地解锁基础编码器在边界敏感病理分割任务中的潜力。

Result: 在三个组织病理学数据集(多中心CHTN、机构OSU和CAMELYON16)上评估,Dino-NestedUNet相比UNet++和标准Dino-UNet变体取得了一致的改进,尤其在跨域偏移下表现更优。在零样本评估中(在CHTN上训练,直接在未见过的TIGER WSIBULK和OSU CRC队列上测试且不进行微调),也展示了良好的外部泛化能力。

Insight: 创新点在于提出了嵌套密集解码器结构,通过密集网格中间路径实现连续特征复用和多尺度重校准,有效对齐了语义与纹理信息。从客观角度看,该方法的核心洞察是密集解码是解锁基础编码器在边界敏感病理分割任务中的关键要素,为利用预训练视觉基础模型提供了新的解码器设计思路。

Abstract: Vision foundation models (VFMs), such as DINOv3, provide rich semantic representations that are promising for computational pathology. However, many current adaptations pair frozen VFMs with lightweight decoders, creating a capacity mismatch that often limits boundary fidelity for infiltrative tumor bulk segmentation. This paper presents Dino-NestedUNet, a framework that couples a pre-trained DINOv3 encoder with a Nested Dense Decoder. Instead of sparse skip connections and linear upsampling, the proposed decoder forms a dense grid of intermediate pathways to enable continuous feature reuse and multi-scale recalibration, aligning high-level semantics with low-level morphological textures during reconstruction. We evaluate Dino-NestedUNet on three histopathology cohorts (multi-center CHTN, institutional OSU, and CAMELYON16) and observe consistent improvements over UNet++ and standard Dino-UNet variants, particularly under cross-domain shift. To further assess external generalization, we perform zero-shot evaluation by training on CHTN and directly testing on unseen TIGER WSIBULK and OSU CRC cohorts without fine-tuning. These results suggest that dense decoding is a key ingredient for unlocking foundation encoders in boundary-sensitive pathology segmentation.


[48] When Less Is More: Simplicity Beats Complexity for Physics-Constrained InSAR Phase Unwrapping cs.CV | cs.AI | cs.LGPDF

Prabhjot Singh, Manmeet Singh

TL;DR: 本文挑战了InSAR相位解缠领域盲目采用高复杂度计算机视觉架构(如注意力机制)的趋势,通过大规模架构消融实验证明,简单的U-Net在物理约束的地球物理回归任务中显著优于复杂的注意力模型,并揭示了注意力机制引入非物理高频伪影的问题。

Details

Motivation: 解决InSAR火山和地震监测中相位解缠的计算瓶颈,质疑当前行业盲目采用复杂计算机视觉架构(如注意力机制)而未验证其是否适用于物理约束的地球物理回归问题。

Result: 在LiCSAR基准测试(20帧,39,724个补丁,6.51亿像素)上,普通U-Net(776万参数)达到R²=0.834和RMSE=1.01厘米,在R²上比基于注意力的1137万参数模型高出34%,在RMSE上高出51%,且推理延迟为2.92毫秒(加速2.5倍),满足操作预警系统的亚100毫秒要求。

Insight: 创新点在于通过功率谱密度分析提供了物理依据,表明注意力机制在自然图像中擅长捕捉尖锐语义边缘,但会向地球物理场注入非物理高频伪影(>0.3周期/像素),违反弹性表面变形的基本平滑约束;主张在ML4RS中采用基于物理的简单性,卷积局部性在平滑场回归中优于现代复杂性。

Abstract: Operational phase unwrapping is the primary computational bottleneck in InSAR-based volcanic and seismic monitoring. We challenge the industry trend of adopting high-complexity computer vision architectures, such as attention mechanisms, without validating their suitability for physics-constrained geophysical regression. We present the first large-scale architectural ablation study on a global LiCSAR benchmark (20 frames, 39,724 patches, 651M pixels). Our results reveal a significant “complexity penalty”: a vanilla U-Net (7.76M parameters) achieves $R^2=0.834$ and RMSE $= 1.01$ cm, outperforming 11.37M-parameter attention-based models by 34% in $R^2$ and 51% in RMSE. Power Spectral Density (PSD) analysis provides the physical justification: while attention excels at capturing sharp semantic edges in natural images, it injects unphysical high-frequency artifacts ($>0.3$ cycles/pixel) into geophysical fields, violating the fundamental smoothness constraints of elastic surface deformation. With a 2.92ms inference latency (a $2.5\times$ speedup), the vanilla U-Net is the only candidate to comfortably meet the sub-100ms requirement for operational early-warning systems. This work bridges the “publication-to-practice” gap by proving that convolutional locality outperforms modern complexity for smooth-field regression, advocating for physics-informed simplicity in ML4RS. Code available at https://github.com/prabhjotschugh/When-Less-is-More-InSAR-Phase-Unwrapping


[49] LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images cs.CV | cs.LGPDF

James Flora, Kowshik Thopalli, Akshay R. Kulkarni, Weng-Keen Wong, Shusen Liu

TL;DR: LatentDiff是一个可扩展的语义数据集比较框架,直接在预训练视觉编码器的潜在空间中操作,通过结合稀疏自编码器的散度测试和密度比估计,以较低计算成本识别数据集间的可解释语义差异。

Details

Motivation: 解决现有基于描述的方法计算成本高的问题,并应对现实世界中稀疏分布偏移带来的挑战,特别是在数据集间仅有极小比例图像存在语义差异时。

Result: 在Noisy-Diff基准测试中,LatentDiff在准确性上优于现有方法,并在仅5%到小于1%的图像存在语义差异的极端稀疏设置下保持鲁棒性,达到SOTA水平。

Insight: 创新点包括利用预训练视觉编码器的潜在空间进行高效比较,以及结合稀疏自编码器和密度比估计来提升可解释性和可扩展性;客观分析认为该方法为大规模数据集比较提供了轻量级解决方案。

Abstract: We present LatentDiff, a scalable framework for semantic dataset comparison that operates directly in the latent space of pretrained vision encoders. By combining sparse autoencoder-based divergence testing with density ratio estimation, LatentDiff identifies interpretable semantic differences between datasets at a fraction of the computational cost of caption-based alternatives. We also introduce Noisy-Diff, a benchmark capturing realistic sparse distribution shifts that cause existing methods to struggle. Experiments demonstrate that LatentDiff achieves superior accuracy while remaining robust to settings where an extremely small fraction of images (from 5% to <1% ) differ semantically.


[50] RA-CMF: Region-Adaptive Conditional MeanFlow for CT Image Reconstruction cs.CV | cs.AIPDF

Md Shifatul Ahsan Apurba, Md Selim, Jin Chen

TL;DR: 本文提出了一种名为RA-CMF的区域自适应条件均值流方法,用于CT图像重建。该方法结合了条件均值流网络和强化学习驱动的策略网络,前者通过预测图像条件流场来建模增强轨迹,后者则根据空间位置自适应地控制图像块的细化预算和停止标准,从而在提升图像质量的同时优化计算资源分配。

Details

Motivation: 由于成像协议和扫描仪型号的差异,不同方式获取的CT图像在噪声统计、对比度和纹理上存在较大差异,这影响了其在肺癌筛查、诊断和治疗规划中的应用。本研究旨在开发一种能够自适应地处理这些差异、提升CT图像重建质量的方法。

Result: 在肿瘤ROI区域,该方法取得了高精度,平均放射组学特征一致性相关系数(CCC)为0.96,平均PSNR为31.30 ± 4.16,平均SSIM为0.94 ± 0.07。整体图像质量也有提升,平均PSNR为34.23 ± 1.71,平均SSIM为0.95 ± 0.01。

Insight: 创新点在于将基于条件流的图像增强与基于强化学习的空间增强控制相结合,通过策略网络实现区域自适应的细化过程,能够针对困难区域集中增强,同时稳定已达标区域,从而在保证质量的同时提高计算效率。这是一种将生成模型与强化学习决策相结合以优化医学图像处理流程的有效思路。

Abstract: The use of CT imaging is important for screening, diagnosis, therapy planning, and prognosis of lung cancers. Unfortunately, due to differences in imaging protocols and scanner models, CT images acquired by different means may show large differences in noise statistics, contrast, and texture. In this study, we develop a novel conditional MeanFlow pipeline for CT image reconstruction. We introduce a conditional MeanFlow network that models the enhancement trajectory by predicting image-conditioned flow fields given intermediate image states. The image enhancement network is trained with a MeanFlow consistency loss along with the image reconstruction loss. In order to provide an adaptive refinement process in terms of spatial location of enhancements, we integrate a regional reinforcement learning-driven policy network into our approach. The policy network receives information about the MeanFlow rollouts and provides predictions in terms of tile-wise refinement budgets, stopping criteria, and total budget allocation of enhancement processes. Our policy network is trained through reinforcement learning in a policy gradient framework, where the goal of the training reward is to maximize improvement of enhancements while minimizing unnecessary computations and avoiding instabilities. In this way, our approach combines conditional flow-based enhancement with reinforcement learning-based spatial enhancement control. This allows our approach to focus more attention on enhancing difficult areas while stabilizing areas already showing sufficient quality. Our results show high accuracy in the tumor ROI, with the average radiomic feature CCC being 0.96, an average PSNR of 31.30 $\pm$ 4.16, and average SSIM of 0.94 $\pm$ 0.07. Moreover, there is an improvement in the overall quality of images, with an average PSNR of 34.23 $\pm$ 1.71 and average SSIM of 0.95 $\pm$ 0.01.


[51] Validation of Whole-Slide Foundation Models for Image Retrieval in TCGA Data cs.CV | cs.IRPDF

Tianhao Lei, Parsa Esmaeilkhani, Saghir Alfasly, Wataru Uegami, Judy C. Boughey

TL;DR: 本研究在TCGA数据集上评估了十种全玻片图像检索方法,包括四种预训练的全玻片基础模型、一种基于注意力的监督聚合方法以及多种基于图像块的检索方法。研究发现,不同器官和诊断间的性能差异远大于模型架构间的差异;尽管TITAN模型整体表现最佳,但其优势有限,基于图像块和监督聚合的方法也能达到相近的检索准确率。研究指出,基于形态学的检索存在内在性能上限,并强调了器官特异性评估、诊断感知策略以及多模态检索框架的必要性。

Details

Motivation: 动机在于评估全玻片基础模型在计算病理学图像检索中的实际价值,并与基于图像块的强基线方法进行比较,以明确其相对优势与局限性。

Result: 在TCGA数据集(涵盖17个器官和60种诊断的9,387张诊断玻片)上,使用患者级别的留一法评估。最佳模型TITAN仅达到约68% ± 21%的Top-1检索准确率,且某些亚型的准确率为0%。监督注意力聚合(ABMIL)和基于图像块的方法在Top-1和Top-3准确率上与之相当,没有模型在所有情况下都占优。

Insight: 创新点在于系统性地比较了全玻片基础模型与多种基线方法,揭示了性能主要由图像块级特征表示驱动,而玻片级聚合的收益有限。客观分析表明,研究挑战了通用最优架构的存在,支持器官解析的基准测试、诊断感知或集成策略、更强的特征表示以及多模态检索框架,并强调了基于形态学表示的固有局限性和临床部署前仍需重大进展。

Abstract: Foundation models are reshaping computational histopathology, yet their value for whole-slide image retrieval relative to strong patch-based and supervised aggregation baselines remains unclear. We benchmarked ten pipelines on 9,387 diagnostic slides spanning 17 organs and 60 diagnoses from The Cancer Genome Atlas (TCGA) using patient-level leave-one-patient-out evaluation. Methods included four pre-trained slide foundation models, a supervised attention-based multiple instance learning (ABMIL) aggregator on patch embeddings, and patch-level retrieval across five sampling densities. Performance varied more across organs and diagnoses than across architectures. Although the slide foundation model TITAN achieved the strongest overall results, its advantage was modest; ABMIL and patch-based methods reached comparable Top-1 and Top-3 accuracy, with no model consistently dominant. Morphologically distinctive entities approached ceiling performance, while rare, heterogeneous, and closely related subtypes remained challenging. Misclassifications aligned with organs exhibiting known inter-observer variability, suggesting an intrinsic ceiling for morphology-only retrieval. Performance was driven primarily by patch-level feature representations, with limited benefit from slide-level aggregation, indicating aggregation may be unnecessary in many settings. These findings argue against a universally optimal architecture and instead support organ-resolved benchmarking, diagnosis-aware or ensemble strategies, stronger feature representations, and multimodal retrieval frameworks. Notably, even the best model achieved only $\approx 68% \pm 21%$ retrieval accuracy on TCGA, and some subtypes showed $0%$ accuracy across all methods, highlighting fundamental limitations of morphology-based representations and the need for substantial progress before reliable clinical deployment.


[52] Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models cs.CV | cs.AI | cs.LGPDF

Hongjun Wang, Po Hu, Kai Han

TL;DR: 本文研究了在领域偏移下的广义类别发现(GCD)问题,提出三种基于不同基础模型(自监督视觉模型到视觉语言模型)的框架:HiLo、HLPrompt和VLPrompt,以处理未标记数据中的领域和语义偏移,并在合成和真实多领域数据集上取得显著改进。

Details

Motivation: 现有GCD方法假设所有数据来自单一领域,但现实世界未标记数据常同时存在领域偏移和语义偏移,因此需要开发能适应领域变化的GCD方法。

Result: 在合成损坏和真实多领域偏移数据集上的广泛实验表明,所提方法相对于强基线模型取得了一致的性能提升。

Insight: 核心创新在于通过多级特征提取和互信息最小化解耦领域与语义特征,并结合语义感知的空间提示调优、因子化文本提示和跨模态一致性正则化,使方法能灵活适配不同基础模型(视觉或视觉语言),适用于多种部署场景。

Abstract: Generalized Category Discovery (GCD) aims to categorize unlabelled instances from both known and unknown classes by transferring knowledge from labelled data of known classes. Existing methods assume all data comes from a single domain, yet real-world unlabelled data often exhibits domain shifts alongside semantic shifts. We study GCD under domain shifts and propose three frameworks that adapt foundation models, ranging from self-supervised vision models to vision-language models. (i) HiLo disentangles domain and semantic features through multi-level feature extraction and mutual information minimization, combined with PatchMix augmentation and curriculum sampling. (ii) HLPrompt extends HiLo with semantic-aware spatial prompt tuning to suppress background and domain noise. (iii) VLPrompt leverages vision-language models via factorized textual prompts and cross-modal consistency regularization. The three methods share core design principles while operating on different foundation backbones, making them suitable for different deployment scenarios. Extensive experiments on synthetic corruptions and real-world multi-domain shifts demonstrate consistent improvements over strong baselines. Project page: https://visual-ai.github.io/hilo/


[53] TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation cs.CV | cs.AI | cs.LGPDF

Han Gong, Zhen Zhou, Yunyang Shi, Yan Tan, Jinbiao Huo

TL;DR: 本文提出了TRIP-Evaluate,一个用于评估大模型在交通领域性能的开放多模态基准测试。该基准包含837个测试项,涵盖文本、图像和点云数据,并采用角色-任务-知识分类法进行组织,旨在全面评估模型在法规问答、交通管理、工程计算和自动驾驶场景推理等任务中的能力。

Details

Motivation: 现有通用基准测试难以评估大模型在规则密集、计算密集、安全关键的交通工作流中的实际能力,而现有交通领域基准则范围狭窄且缺乏对文本、图像和点云数据的细粒度诊断支持。

Result: 在多种模型上的测试结果表明,基于文本的性能在提升,但在多步工程计算、规则约束推理、多模态场景理解和点云理解方面仍存在显著不足。

Insight: 论文的创新点在于构建了一个系统化、可诊断且与工程实践对齐的交通领域多模态评估基准,其标准化的项目构建、质量控制、提示和解码流程提高了模型间的可比性,为模型选择、回归测试和安全部署提供了基线。

Abstract: Large language models (LLMs) and multimodal large models (MLLMs) are increasingly used for transportation tasks such as regulation question answering, traffic management support, engineering review, and autonomous-driving scene reasoning. Yet transportation workflows are rule-intensive, computation-intensive, safety-critical, and inherently multimodal. Existing general benchmarks provide limited evidence of whether a model can apply regulations correctly, perform verifiable engineering calculations, or interpret traffic scenes reliably, while the small number of public transportation benchmarks remain narrow in scope and rarely support fine-grained diagnosis across text, images, and point-cloud data. To address this gap, we present TRIP-Evaluate, an open multimodal benchmark for large models in transportation. The benchmark organizes 837 items using a role-task-knowledge taxonomy that covers vehicle, traffic-management, traveler, and planning-and-design functions. Each item is annotated with capability, modality, and difficulty labels, enabling diagnosis from overall accuracy down to specific failure modes. The current release includes 596 text items, 198 image items, and 43 point-cloud items. TRIP-Evaluate also standardizes item construction, quality control, prompting, decoding, and scoring to improve cross-model comparability. Results on a diverse panel of models show that text-based performance is improving, but substantial weaknesses remain in multi-step engineering calculation, rule-constrained reasoning, multimodal scene understanding, and point-cloud understanding. Overall, TRIP-Evaluate provides a reproducible, diagnosable, and engineering-aligned evaluation baseline for model selection, regression testing, and safer deployment in transportation applications.


[54] Rethink MAE with Linear Time-Invariant Dynamics cs.CV | cs.AIPDF

Zice Wang

TL;DR: 本文挑战了视觉表示探测中忽略token顺序的传统范式,提出SSMProbe框架,利用状态空间模型作为对顺序敏感的探测器,揭示冻结视觉表示中token顺序是可利用的关键维度。研究发现不同预训练目标(如MAE、DINOv2、ViT)会形成不同的token结构异质性,而SSMProbe通过学习软排列能有效提取这些信息,为视觉表示分析提供新诊断工具。

Details

Motivation: 标准视觉表示探测方法(如全局平均池化或CLS token)将patch表示视为无序词袋,忽略了token顺序这一关键维度。本文旨在证明冻结视觉表示中的token顺序是可利用的重要信息维度,并开发能敏感捕捉顺序信息的探测框架。

Result: 在标准和细粒度分类基准测试中,固定扫描启发式方法在高度局部化的patch特征上表现很差,而学习的软排列方法能从高度局部化的patch序列中提取出极具竞争力的性能。SSMProbe通过学习路由有效发现并利用了token异质性。

Insight: 创新点在于将token排序重新定义为信息调度问题,并引入基于Sinkhorn的可微分软排列学习机制;客观分析发现,预训练目标(MAE、BEiT、DINOv2、ViT)会形成不同的token结构异质性,且这种异质性与顺序相关而非仅仅是空间网格的拓扑属性,这为理解视觉表示提供了新视角。

Abstract: Standard representation probing for visual models relies on mathematically permutation-invariant operations like Global Average Pooling (GAP) or CLS tokens, treating patch representations as an unstructured bag-of-words. We challenge this paradigm by demonstrating that token order is a critical, exploitable dimension in frozen visual representations (e.g., MAE, BEiT, DINOv2, and ViT as CLS-ablation extreme). We propose SSMProbe, a probing framework driven by a State Space Model (SSM). Operating as discrete Linear Time-Invariant (LTI) dynamical systems, SSMs act as permutation-sensitive probes where sequence order strictly dictates the final state due to inherent memory decay. Formulating token ordering as an information scheduling problem, we compare fixed scan heuristics against a differentiable soft permutation (Sinkhorn-based) learned from downstream supervision. Evaluations on standard and fine-grained classification benchmarks reveal a striking order gap: while fixed scans fail dramatically on highly localized patch features, our learned soft permutation successfully extracts highly competitive performance from otherwise heavily localized patch sequences. We find that pre-training objectives fundamentally shape token structure: DINOv2 concentrates global semantics in optimized CLS tokens leaving patches hyperspecialized, pure MAE preserves distributed representations with heterogeneous patch informativeness, and ViT represents a supervised CLS-dominated extreme. BEiT occupies middle ground. This heterogeneity is order-dependent – meaning the SSM probe’s performance depends critically on which tokens are placed at which temporal positions – and is not merely a topological property of the spatial grid. SSMProbe’s learned routing effectively discovers and exploits this heterogeneity, offering a powerful new diagnostic lens for visual representation analysis.


[55] Energy-Based Constraint Networks: Learning Structural Coherence Across Modalities cs.CV | cs.CLPDF

Chirag Shinde

TL;DR: 本文提出了一种基于能量的约束网络,这是一种模态无关的架构,通过对比学习结构一致性。该系统利用双头注意力状态空间模型处理冻结编码器的嵌入,生成衡量结构一致性的标量能量以及定位违规的逐位置能量分数。多个独立训练的分支可检测不同类型的违规,并在推理时组合而不相互干扰。

Details

Motivation: 动机是学习跨模态的结构一致性,通过能量模型显式建模结构违规,并实现模态无关和编码器无关的灵活架构。

Result: 在文本领域,使用冻结BERT和740万可训练参数,在训练过的损坏类型上达到93.4%准确率,在9种未见类型上达到87.2%;在视觉领域,使用冻结DINOv2和每分支360万参数,在FaceForensics++ Deepfakes上达到0.959 AUC,在未使用训练数据的Celeb-DF上达到0.870 AUC,具有竞争力的深度伪造检测性能。

Insight: 创新点包括:首次将模态内结构一致性学习为显式的能量景观并进行逐位置分解;通过仅重新指定损坏策略即可实现跨模态迁移;验证了组合分支需要表示兼容性;架构支持灵活训练(设计者指定损坏、真实世界配对数据或两者结合)。

Abstract: We introduce energy-based constraint networks – a modality-agnostic architecture that learns structural coherence from contrastive pairs. The system processes frozen encoder embeddings through a state-space model with dual-head attention, producing a scalar energy measuring structural consistency alongside per-position energy scores that localize violations. Multiple independently trained branches detect different violation types and compose at inference without interference. We demonstrate the framework in two domains. In text, the system achieves 93.4% accuracy on trained corruption types and 87.2% on 9 unseen types, using frozen BERT and 7.4M trainable parameters. In vision, the same architecture achieves competitive deepfake detection: 0.959 AUC on FaceForensics++ Deepfakes and 0.870 on Celeb-DF without any Celeb-DF training data, using frozen DINOv2 and 3.6M parameters per branch. The framework supports flexible training: branches learn from designer-specified corruptions, real-world paired data, or both. Composable branches require representation compatibility – a finding validated through extensive experimentation where five incompatible approaches failed before the compatible one succeeded. The architecture is encoder-agnostic and domain-agnostic: changing the domain requires only new corruption strategies; changing the encoder requires only a new input projection layer. To our knowledge, this is the first architecture to learn within-modality structural coherence as an explicit energy landscape with per-position decomposition, and to demonstrate that the same architecture transfers across modalities via corruption respecification alone.


[56] WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild cs.CVPDF

Junzhe Huang, Xiaoxiao Sun, Yan Yang, Yuxuan Hou, Ruotian Zhang

TL;DR: 本文介绍了WildTableBench,这是首个针对真实场景中自然出现的表格图像进行问答评估的基准测试。该基准包含从在线论坛和网站收集的402张高信息密度表格图像,以及928个手动标注和验证的问题,涵盖五个类别下的17个子类型。作者评估了21个前沿的专有和开源多模态基础模型,发现仅有一个模型准确率超过50%,其余模型准确率在4.1%至49.9%之间。诊断分析揭示了模型在结构感知和推理方面的持续弱点。

Details

Motivation: 当前多模态基础模型在表格图像分析方面的评估主要依赖于结构化文本表格或干净的渲染图像,未能充分探索真实世界中表格图像的视觉复杂性(如多样布局和领域),这限制了模型在实际应用中的表现。

Result: 在WildTableBench基准上,21个评估模型中仅有一个模型准确率超过50%(具体模型未指定),其余模型准确率介于4.1%到49.9%之间,表明当前模型在处理真实世界表格图像时整体表现不佳。

Insight: 创新点在于构建了首个专注于真实世界表格图像理解的问答基准,强调了视觉复杂性和多领域需求;从客观角度看,该基准提供了诊断工具,揭示了多模态模型在结构感知和数值推理方面的关键弱点,为未来模型改进指明了方向。

Abstract: Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding.


[57] EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness cs.CV | cs.AIPDF

Yueru Sun, Yimeng Zhang, Haoyu Gu, Nuo Chen, Dong She

TL;DR: 该论文提出了EmoMM基准,用于系统评估多模态大语言模型在模态冲突和缺失情况下的情感识别能力,并揭示了视频贡献崩溃现象。针对此问题,作者提出了CHASE方法,一种轻量级的注意力引导机制,能在推理时检测模态冲突并调整注意力,有效提升模型在复杂情感场景中的可靠性。

Details

Motivation: 多模态情感识别对于理解真实世界交互至关重要,但现有MLLM在模态冲突和缺失情况下的内部决策机制尚未得到充分探索,因此需要系统性的基准和方法来评估和改进其鲁棒性。

Result: 实验结果表明,提出的CHASE方法在各种设置下持续提升了性能,显著增强了MLLM在复杂情感场景中的可靠性。

Insight: 论文的创新点在于构建了包含模态对齐、冲突和缺失子集的综合基准EmoMM,并揭示了视频贡献崩溃现象;提出的CHASE机制通过推理时注意力引导来缓解决策偏差,无需重新训练主干模型,是一种轻量且有效的解决方案。

Abstract: Multimodal Emotion Recognition (MER) is critical for interpreting real-world interactions. While Multimodal Large Language Models (MLLM) have shown promise in MER, their internal decision-making mechanisms under modality conflict and missingness remain largely underexplored. In this paper, to systematically investigate these behaviors, we introduce EmoMM, a comprehensive benchmark featuring modality-aligned, conflict, and missing subsets. Through extensive evaluation, we uncover a Video Contribution Collapse (VCC) phenomenon, where MLLM marginalize video evidence due to high token redundancy and modality preferences. To address this, we propose Conflict-aware Head-level Attention Steering (CHASE), a lightweight mechanism that detects modality conflicts and performs inference-time attention steering, effectively mitigating decision bias without retraining the backbone. Experimental results demonstrate that CHASE consistently improves performance across various settings, significantly enhancing the reliability of MLLM in complex affective scenarios.


[58] ScribbleEdit: Synthetic Data for Image Editing with Scribbles and Text cs.CVPDF

Anya Ji, George Ma, Téa Wright, Yiming Zhang, David Chan

TL;DR: 本文提出ScribbleEdit,一个结合自然语言指令和手绘涂鸦的大规模合成数据集,用于提升图像编辑的精确可控性。通过合成流程自动生成源-目标图像对,并配以人工涂鸦和VLM生成的文本指令,用于评估和微调扩散模型与自回归统一多模态图像编辑模型。

Details

Motivation: 现有生成模型在图像编辑中难以同时实现精确的空间布局控制和语义细节传达,因为文本指令缺乏空间特异性,而涂鸦无法表达详细视觉属性,且缺乏专门的训练数据来联合解释抽象涂鸦和文本。

Result: 实验表明,现成模型在处理抽象涂鸦输入时表现不佳,但在ScribbleEdit数据集上微调后,其生成空间对齐和语义一致的编辑能力显著提升。

Insight: 创新点在于构建了结合涂鸦和文本指令的合成数据集,以解决多模态控制的数据稀缺问题;客观分析认为,该方法通过数据驱动方式增强了模型对抽象空间输入的理解,为可控图像编辑提供了实用解决方案。

Abstract: Recent progress in generative models has significantly advanced image editing capabilities, yet precise and intuitive user control remains difficult. Specifically, users often struggle to communicate both exact spatial layouts and specific semantic details simultaneously. While natural language instructions effectively convey high-level semantics like texture and color, they lack spatial specificity. Conversely, freehand scribbles provide rough spatial boundaries but cannot express detailed visual attributes. Consequently, achieving precise control requires combining both modalities. However, existing models struggle to jointly interpret abstract scribbles alongside text due to a lack of specialized training data. In this work, we introduce ScribbleEdit, a large-scale synthetic dataset designed to bridge this gap by combining natural language instructions with freehand scribble inputs for more accurate, controllable edits. We construct this dataset through a synthetic pipeline that automatically generates source-target image pairs via inpainting, which are then paired with human-drawn scribbles and VLM-generated text instructions. Using ScribbleEdit, we evaluate and finetune both diffusion-based and autoregressive unified multimodal image editing models. Our experiments reveal that while off-the-shelf models struggle with abstract scribble inputs, finetuning on our synthetic dataset significantly improves their ability to generate spatially aligned and semantically consistent edits.


[59] Semantic Context-aware mOdality fUsion Transformer (SCOUT): A Context-Aware Multimodal Transformer for Concept-Grounded Pathology Report Generation cs.CV | cs.AIPDF

Suryakant Singh, Saarthak Kapse, Joel Saltz, Prateek Prasanna

TL;DR: 本文提出了SCOUT(语义上下文感知模态融合Transformer),一个用于病理报告生成的上下文感知、概念基础的多模态框架。该框架通过全局切片信息和显式诊断概念对图像表示进行渐进式调节,整合了局部组织学模式、全切片上下文和专家标注的语义描述符,以生成临床连贯的报告。

Details

Motivation: 解决现有病理基础模型在生成报告时缺乏临床基础,难以准确表示病理学家观察到的关键诊断概念和关系的问题,以及整合跨细粒度细胞模式、切片级组织架构和高级诊断概念等异构视觉证据的困难。

Result: 在TCGA-BRCA、MICCAI REG和HistAI数据集上,使用CONCH1.5特征,SCOUT在BLEU-1到BLEU-4以及METEOR指标上均优于WSI-Caption、HistGen和BiGen模型,并在TCGA-BRCA和MICCAI REG上取得了最佳的ROUGE-L分数。具体在TCGA-BRCA上达到BLEU-1/2/3/4为0.436/0.303/0.202/0.156,METEOR为0.204;在REG 2025上达到0.865/0.834/0.805/0.780和0.568。

Insight: 创新点在于提出了一个渐进式上下文调节的、概念基础的多模态融合框架,通过深度感知的上下文调制和自适应多模态融合,动态细化视觉特征,以生成临床连贯且保持多尺度表征互补性的病理报告。

Abstract: Whole-slide images (WSIs) present a fundamental challenge for computational pathology due to their extreme resolution, multi-scale heterogeneity, and the requirement for clinically reliable interpretation. Although recent pathology foundation models have enabled fluent report generation, they often lack clinical grounding, failing to accurately represent key diagnostic concepts and relationships observed by pathologists. This limitation arises from the difficulty of integrating heterogeneous visual evidence spanning fine-grained cellular patterns, slide-level tissue architecture, and high-level diagnostic concepts, while maintaining interpretability and clinical coherence. Here we present SCOUT: Semantic Context-aware mOdality fUsion Transformer, a context-aware concept-grounded multimodal framework for pathology report generation that enables progressive conditioning of image representations by global slide information and explicit diagnostic concepts. The method integrates local histological patterns, whole-slide context, and expert-curated semantic descriptors within a unified learning paradigm, allowing visual features to be dynamically refined throughout the encoding process. By combining depth-aware contextual modulation with adaptive multimodal fusion during text generation, the framework produces clinically coherent reports while preserving complementarity across representational scales. Using CONCH1.5 features, we evaluate SCOUT against WSI-Caption, HistGen, and BiGen on TCGA-BRCA, MICCAI REG, and HistAI. SCOUT achieves the best BLEU-1 to BLEU-4 and METEOR scores on all datasets, plus the best ROUGE-L on TCGA-BRCA and MICCAI REG. On TCGA-BRCA, it reaches 0.436/0.303/0.202/0.156 BLEU-1/2/3/4 and 0.204 METEOR; on REG 2025, it achieves 0.865/0.834/0.805/0.780 and 0.568. These results support progressive contextual conditioning for grounded pathology report generation.


[60] CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition cs.CVPDF

Valter Estevam, Rayson Laroca, Helio Pedrini, David Menotti

TL;DR: 本文提出了一种基于对比学习的零样本动作识别方法CEZSAR,旨在解决训练时未见过的动作类别的分类问题。该方法通过构建视频和自然语言描述的联合嵌入空间,并设计自动负采样策略来增强训练数据,从而缓解语义鸿沟和领域偏移问题。

Details

Motivation: 解决零样本动作识别中的语义鸿沟(文本与视觉表示之间的差异)和领域偏移(训练集与测试集之间的差异)两大挑战。

Result: 在UCF-101和Kinetics-400数据集上的多个划分配置下达到了最先进的性能水平(SOTA)。

Insight: 通过联合嵌入空间对齐视频与自然语言描述,并引入自动负采样生成不匹配的视觉-文本对以增强对比学习效果,这为多模态表示学习提供了可借鉴的思路。

Abstract: This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space, trained by aligning videos with their natural-language descriptions. We design an automatic negative sampling procedure to augment the training dataset and generate unpaired data, i.e., visual appearance and unrelated descriptions. Our results are state-of-the-art on the UCF-101 and Kinetics-400 datasets under several split configurations. Our code is available at https://github.com/valterlej/cezsar.


[61] CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization cs.CV | cs.LGPDF

Ghadi Nehme, Eamon Whalen, Faez Ahmed

TL;DR: CADFit是一种基于混合优化的CAD重建框架,能够从网格中恢复复杂、可编辑的CAD构造序列,通过几何反馈逐步拟合和验证参数化操作,并在多个基准测试中优于现有方法。

Details

Motivation: 解决从网格或点云等几何输入中恢复参数化CAD构造序列的挑战,因为现有方法局限于难以编辑的格式或简单操作,无法处理复杂设计。

Result: 在多个CAD基准测试中,CADFit在体积交并比和Chamfer距离上优于最先进的网格到CAD方法,并显著降低了重建CAD程序的无效率,特别是在复杂设计上。

Insight: 创新点在于将重建问题建模为结构化CAD程序上的IoU驱动优化,支持丰富的操作集(如拉伸、旋转、圆角和倒角),并结合图像重建实现端到端序列恢复,为基于学习的CAD逆向工程提供了实用基础。

Abstract: Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficult-to-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering.


[62] TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos cs.CVPDF

Nima Rahmanian, Daniel Kienzle, Thomas Gossard, Dvij Kalaria, Rainer Lienhart

TL;DR: 本文提出了TT4D,一个用于从单目广播视频中进行乒乓球4D重建的大规模、高保真数据集和配套重建流程。该数据集包含超过140小时的单打和双打比赛重建数据,并提供高质量相机标定、精确3D球位、球旋转、时间分割以及随时间变化的3D人体网格等多模态标注。其核心创新在于一个‘先提升后分割’的重建流程,即首先通过一个学习网络将未分割的2D球轨迹提升至3D,再基于3D轨迹进行可靠的时间分割,从而克服了传统基于2D分割方法在遮挡和多视角下的失效问题。

Details

Motivation: 现有方法基于2D球轨迹进行时间分割后再尝试重建,但在遮挡和不同摄像机视角下会失效,导致无法从通用视角的单目广播视频中可靠地重建乒乓球比赛。

Result: 该流程是唯一能够从通用视角的单目广播视频中重建乒乓球比赛的方法。通过两个下游任务(估计球拍在击球时的位姿和速度,以及训练一个竞技回合的生成模型)验证了数据集的高保真度。

Insight: 创新点在于颠覆了传统的‘先分割后重建’范式,提出了‘先提升(2D到3D)后分割’的流程。其核心是一个学习的提升网络,该网络不仅能将整个未分割的2D球轨迹提升至3D,还能推断球的旋转、处理不可靠的球检测,并在高遮挡情况下成功重建球轨迹,从而实现了大规模、高精度数据集的构建。

Abstract: We present TT4D, a large-scale, high-fidelity table tennis dataset. It provides $140+$ hours of reconstructed singles and doubles gameplay from monocular broadcast videos, featuring multimodal annotations like high-quality camera calibrations, precise 3D ball positions, ball spin, time segmentation, and 3D human meshes over time. This rich data provides a new foundation for virtual replay, in-depth player analysis, and robot learning. The dataset’s combination of scale and precision is achieved through a novel reconstruction pipeline. Prior methods first partition a game sequence into individual shot segments based on the 2D ball track, and only then attempt reconstruction. However, 2D-based time segmentation collapses under occlusion and varied camera viewpoints, preventing reliable reconstruction. We invert this paradigm by first lifting the entire unsegmented 2D ball track to 3D through a learned lifting network. This 3D trajectory then allows us to reliably perform time segmentation. The learned lifting network also infers the ball’s spin, handles unreliable ball detections, and successfully reconstructs the ball trajectory in cases of high occlusion. This lift-first design is necessary, as our pipeline is the only method capable of reconstructing table tennis gameplay from general-view broadcast monocular videos. We demonstrate the dataset’s fidelity through two downstream tasks: estimating the racket’s pose & velocity at impact, and training a generative model of competitive rallies.


[63] Exploring Prompt Alignment with Clinical Factors in Zero-Shot Segmentation VLMs for NSCLC Tumor Segmentation cs.CVPDF

Suraj Pai, Thibault Heintz, Cosmin Ciausu, Marion Tonneau, Hugo Aerts

TL;DR: 本文研究了零样本视觉语言模型(VLM)在非小细胞肺癌(NSCLC)肿瘤分割任务中,其空间行为如何受不同提示维度(如诊断、人口统计、分期、解剖位置等)的影响。通过系统分析VoxTell模型在内部数据集上的表现,发现解剖位置是模型空间注意力的主导驱动因素,而组织学和分期影响甚微。在零样本设置下,VoxTell的分割性能与微调模型(如nnUNet)相当,并显著优于其他零样本基线。

Details

Motivation: 零样本视觉语言模型为NSCLC肿瘤分割提供了一种无需任务特定训练的可提示方法,但控制其空间行为的提示维度尚不明确。本研究旨在探究不同临床因素提示如何影响模型的分割对齐行为。

Result: 在内部NSCLC肿瘤数据集上,VoxTell在零样本设置下达到平均Dice相似系数(DSC)0.613,与微调模型nnUNet(DSC 0.690)和Ahmed等人方法(DSC 0.675)无统计学显著差异(调整后p值分别为0.156和0.679),但显著优于其他所有零样本模型。对齐分析显示,63.4%的位置扰动导致分割性能灾难性下降,而组织学和分期替换影响最小。

Insight: 创新点在于系统解构了提示维度(诊断、人口统计、分期、解剖、通用等)对分割VLM空间对齐的影响,揭示了模型更关注’在哪里看’(解剖位置)而非’看什么’(如组织学类型)。这提示评估分割VLM时,不仅应关注Dice分数,还应考虑其对不同提示维度的对齐敏感性。

Abstract: Zero-shot vision-language models (VLMs) offer a promptable alternative to task-specific training for gross tumor volume (GTV) delineation in non-small-cell lung cancer (NSCLC), but the prompt dimensions that govern their spatial behavior remain poorly understood. We study this question by probing alignment directions in VoxTell on a held-out internal NSCLC tumor dataset through sub-prompt decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls; attribute-wise perturbation robustness; specificity ladders; and cross-case prompt swaps, while benchmarking against fine-tuned and zero-shot baselines using the Dice Similarity Coefficient (DSC) with Wilcoxon signed-rank tests and Benjamini-Hochberg correction. Alignment analyses revealed that anatomical location is the dominant driver of VoxTell’s spatial attention: 63.4 percent of location perturbations caused catastrophic drops, prompt specificity improved from generic to full descriptions except for diagnosis-only prompts, irrelevant prompts correctly yielded zero segmentation, and cross-case prompt swaps confirmed patient-specific conditioning (matched DSC 0.906 vs. mismatched 0.406). Histology and stage substitutions had minimal effect, indicating that the model prioritizes “where to look” over “what to look for.” In this context, VoxTell, operating fully zero-shot, achieved a mean DSC of 0.613, statistically indistinguishable from nnUNet (0.690, adjusted p = 0.156) and Ahmed et al. (0.675, adjusted p = 0.679), while significantly outperforming all other zero-shot models. Together, these findings argue that segmentation VLMs should be evaluated not only by Dice, but also by the prompt dimensions to which they align.


[64] GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment cs.CV | eess.IVPDF

Rajesh Sureddi, Shreshth Saini, Avinab Saha, Alan C. Bovik

TL;DR: 本文介绍了GameScope,一个针对游戏视频质量评估的多属性、多编解码器基准数据集,包含4048个视频样本,覆盖H.264、H.265和AV1编解码器,并收集了平均意见分数和细粒度质量属性,以支持跨编解码器的一致性质量评估模型。

Details

Motivation: 当前游戏视频流媒体快速发展,但缺乏大规模、多样化的主观游戏视频质量数据集,现有数据集存在局限性,无法支持跨不同编解码器的质量评估模型。

Result: 在GameScope数据集上评估了领先的视频质量评估方法,包括一个表现优于所有基准的视觉语言模型,该数据集是目前最大的游戏视频质量数据集。

Insight: 创新点在于首次构建了覆盖多编解码器和内容类型(UGC和PGC)的游戏视频质量数据集,并引入细粒度质量属性以深入理解感知因素,为跨编解码器质量评估提供了全面基准。

Abstract: The development of video game streaming has grown rapidly, with major platforms such as YouTube and Twitch using different codecs. To support quality assessment models that work consistently across any codec, it is necessary to have access to large, diverse subjective gaming quality datasets. Currently, there are only a few available, each having limitations. To address this gap, we present the largest gaming video quality dataset to date, incorporating both user-generated content (UGC) and professional-generated content (PGC) with extensive visual diversity. Our dataset covers the most widely used codecs - H.264, H.265, and AV1 - and consists of 4,048 video samples, each annotated by an average of 37 mean opinion score (MOS) ratings. In addition to overall quality scores, we collect coarse-grained quality attributes, enabling a better understanding of perceptual factors. We study the performance of leading video quality assessment methods on this dataset, including a vision language model that outperforms all the benchmarks. To the best of our knowledge, this is the first dataset that comprehensively addresses gaming video quality assessment across multiple codecs and content types with quality attributes. Our dataset is publicly available at https://rajeshsureddi.github.io/GameScope/.


[65] CNN-based Multi-In-Multi-Out Model for Efficient Spatiotemporal Prediction cs.CV | cs.AIPDF

Hyeonseok Jin

TL;DR: 本文提出了一种名为MIMO-ESP的基于CNN的多输入多输出模型,用于高效的时空预测。该模型通过将Transformer架构配置在CNN之上来考虑全局信息并显著降低复杂度,同时将时间轴作为独立轴处理并应用膨胀卷积来有效联合考虑时空信息。

Details

Motivation: 现有基于CNN的时空预测模型难以考虑全局信息且性能受限,同时将时间轴与图像通道轴混合处理;而基于Transformer的模型则因自注意力计算导致复杂度高、训练时间长。本文旨在克服这些局限性。

Result: 在视频、交通和降水预测三个有前景的基准数据集上的广泛实验表明,MIMO-ESP在实现有竞争力的效率的同时,性能超越了现有模型。消融研究结果也证明了其各组件的有效性。

Insight: 主要创新点在于提出了一种基于CNN的Transformer架构变体,以较低复杂度实现全局信息建模;同时,将时间轴作为独立维度处理并应用膨胀卷积,更有效地融合时空信息,从而在效率和性能上取得平衡。

Abstract: Recently, Convolutional Neural Network (CNN) or Transformer architecture based models have been proposed to overcome the limitations of Recurrent Neural Network (RNN) based models in spatiotemporal prediction. These models prevent the inefficiency of parallelization limitation due to the sequential properties and stacked error due to the recursive method, and show high performance. Novertheless, there are still some challengies. First, CNN based models have difficulty considering global information due to the local properties of the kernel, and their performance is limited. In addition, information is mixed because the time axis is combined with the channel axis of the image for processing. Models based on Transformer architecture have high complexity due to the self-attention calcuation and take a long training time. In this paper, we propose a new structure model called CNN-based Multi-In-Multi-Out model for Efficient Spatiotemporal Prediction (MIMO-ESP) to overcome these limitations. MIMO-ESP considers global information and significantly improves complexity by configuring a Transformer architecture based on CNN. In addition, it treats the time axis as an independent axis without combining it, and effectively considers spatiotemporal information together by applying dilation. This structure makes MIMO-ESP efficient and high performance. Extensive experiment results on three promising benchmark datasets which including video, traffic, and precipitation prediction tasks demonstrate that the usefulness of MIMO-ESP due to the achieved competitive efficiency while outperforming existing models. Furthermore, the ablation study results demonstrate the usefulness of the components of MIMO-ESP, emphasizing the potential of the proposed approaches.


[66] Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation cs.CV | cs.AI | cs.CL | cs.IRPDF

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

TL;DR: 本文提出了一种名为Chain of Evidence (CoE)的视觉归因框架,用于解决迭代检索增强生成在视觉丰富文档上的瓶颈。该框架利用视觉语言模型直接对检索到的文档截图进行推理,输出精确的边界框以可视化证据链,并在Wiki-CoE和SlideVQA两个基准测试中验证了其有效性。

Details

Motivation: 当前基于文本解析的迭代检索增强生成存在两个关键瓶颈:一是粗粒度的归因,用户需手动在冗长文档中定位证据;二是视觉语义损失,将视觉丰富的文档转换为文本时会丢失对推理至关重要的空间逻辑和布局线索。

Result: 在Wiki-CoE和SlideVQA两个基准上的实验表明,经过微调的Qwen3-VL-8B-Instruct模型取得了稳健的性能,在需要视觉布局理解的场景中显著优于基于文本的基线方法,并为像素级可解释的iRAG建立了一个与检索器无关的解决方案。

Insight: 主要创新点在于提出了一个与检索器无关的像素级视觉归因框架,直接对文档截图进行端到端推理,避免了特定格式的解析,并可视化完整的证据链。这为处理包含图表、自由布局等视觉丰富文档的复杂问答提供了新的可解释性解决方案。

Abstract: Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.


[67] SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On cs.CVPDF

Kosuke Takemoto, Takafumi Koshinaka

TL;DR: SIFT-VTON提出了一种基于SIFT关键点匹配的显式几何监督方法,用于增强基于扩散模型的虚拟试穿任务。该方法通过将服装与人体图像之间的SIFT关键点匹配关系转换为空间概率分布,在训练过程中监督交叉注意力层,从而学习精确的空间对齐,更好地保留服装的文本和图案细节。

Details

Motivation: 现有的基于扩散模型的虚拟试穿方法依赖交叉注意力机制隐式学习空间对应关系,难以保留服装上的精细细节(如文字和图案)。本文旨在通过引入显式的几何指导来解决这一问题。

Result: 在VITON-HD数据集上的实验表明,该方法在非配对指标上取得了显著提升,同时保持了有竞争力的配对重建指标。定性比较显示其在文本清晰度和图案对齐方面表现更优。

Insight: 核心创新点在于将经典的SIFT关键点匹配方法作为显式几何监督信号,引导扩散模型的交叉注意力层学习更精确的空间对齐。这证明了经典几何方法可以有效增强现代扩散模型在条件合成任务中的性能。

Abstract: Diffusion-based virtual try-on methods achieve photorealistic synthesis through cross-attention mechanisms that transfer garment features to target body regions. However, these approaches rely on implicit learning of spatial correspondences, struggling to preserve fine details such as text and illustrations. We propose a novel approach, which we call SIFT-VTON, that utilizes SIFT keypoint matching to provide explicit geometric guidance for diffusion-based virtual try-on. Our method applies domain-specific filtering to SIFT keypoint matches between garment and person images, then converts these correspondences into spatial probability distributions that supervise cross-attention layers during training. This explicit supervision guides the model to learn precise spatial alignment, concentrating attention on geometrically consistent garment regions. Experiments on the VITON-HD dataset demonstrate significant improvements on unpaired metrics while maintaining competitive paired reconstruction metrics. Qualitative comparisons show superior preservation of text clarity and pattern alignment. Attention visualizations confirm that our method produces sharply focused attention on relevant garment details. This work demonstrates that classical geometric correspondence methods can effectively enhance modern diffusion models for conditional synthesis tasks. The source code will be available at https://github.com/takesukeDS/SIFT-VTON.


[68] CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning cs.CVPDF

Ruichi Zhang, Chikai Shang, Jiacheng Yang, Mengke Li, Yang Zhou

TL;DR: 本文提出CUE方法,通过引入多标签概念信号来解决长尾学习中的概念混淆问题。CUE利用零样本CLIP提取实例级视觉线索,并结合LLM生成的类别级语义线索,通过加权二元逻辑调整辅助损失与基线逻辑调整损失联合优化,以恢复被破坏的类间关系。

Details

Motivation: 现有长尾学习方法主要关注缓解长尾分布偏差,但忽视了长尾分布导致的概念混淆问题,该问题源于单标签监督的互斥性抑制了相关类别间的特征共享,并放大了头部类别的优势,破坏了类间可区分性。

Result: 在多个长尾基准测试中,CUE实现了均衡且强大的性能,超越了近期的SOTA方法。

Insight: 创新点在于通过多标签概念扩展来缓解概念混淆,结合了零样本视觉模型和大语言模型提供的跨模态线索,并设计了加权二元逻辑调整损失来整合这些线索,从而更全面地建模类间关系。

Abstract: Long-tailed distributions are common in real-world recognition tasks, where a few head classes have many samples while most tail classes have very few. Recently, fine-tuning foundation models for long-tailed learning has gained attention due to their excellent performance. However, most existing methods focus solely on mitigating long-tailed distribution bias while overlooking concept confusion caused by the long-tailed distribution. In this paper, we study this problem and attribute it to the mutual exclusivity of single-label supervision under long-tailed distributions, which suppresses feature sharing among related classes and amplifies the dominance of head classes, leading to disrupted inter-class discriminability. To address this, we propose CUE, Concept-aware mUlti-label Expansion, which introduces multi-label concept signals to preserve disrupted inter-class relationships. Specifically, CUE constructs concept sets by (i) extracting instance-level visual cues from zero-shot CLIP and (ii) generating class-level semantic cues with LLM; the two cues are incorporated via separately weighted Binary Logit-Adjustment (BLA) auxiliary losses and jointly optimized with the baseline Logit-Adjustment (LA) loss. Experiments on several long-tailed benchmarks, CUE achieves balanced and strong performance, surpassing recent state-of-the-art methods. Code is available at: https://github.com/zhangruichi/CUE.


[69] Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs cs.CVPDF

Jingze Wu, Quan Zhang, Hongfei Suo, Zeqiang Cai, Hongbo Chen

TL;DR: 本文提出VideoThinker,一个受因果启发的框架,旨在解决轻量级多模态大语言模型在强化学习微调中存在的感知偏见问题。该框架通过两阶段去偏过程,首先训练一个专门的‘偏见模型’来捕捉数据偏见导致的捷径行为,然后通过因果去偏策略优化算法,使主模型主动远离偏见模型的错误逻辑,同时拉向正确且可泛化的解决方案。

Details

Motivation: 尽管强化学习提升了大型多模态语言模型的推理能力,但其在轻量级模型上的效果有限,这些模型对边缘部署至关重要。研究发现,基于强化学习的微调会迫使轻量级模型优先采用数据偏见诱导的感知捷径,而非发展真正的推理能力,因此需要解决这种感知偏见问题。

Result: 提出的VideoThinker-R1模型在视频推理效率上达到了新的最先进水平。在同规模比较中,无需监督微调且仅使用1%的强化学习训练数据,它在广泛使用的基准测试上平均超越VideoRFT-3B模型3.2%,在VideoMME上领先7%。在跨规模比较中,它超越了更大的Video-UTR-7B模型,在MVBench上提升2.1%,在TempCompass上提升3.8%。

Insight: 论文的创新点在于通过因果分析揭示感知偏见现象,并设计了一个两阶段去偏框架,包括偏见感知训练和因果去偏策略优化算法。这提供了一种有效的方法来提升轻量级模型的泛化推理能力,避免依赖数据偏见,具有实际应用价值。

Abstract: Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge deployments.To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities.Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated “bias model” to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model’s flawed logic while simultaneously pulling it toward correct, generalizable solutions.Our model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1 of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass. Code is available at https://github.com/falonss703/VideoThinker.


[70] Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance cs.CV | cs.LGPDF

Muyang Li, Yucheng Liu, Jianbo Ma, Elliot Osborne, Bo Han

TL;DR: 本文系统研究了视觉语言模型(VLM)中视觉编码器的选择问题,发现传统基于模型大小或零样本精度的选择方法效果不佳。作者提出使用Gromov-Wasserstein距离衡量跨模态结构相似性,作为预测VLM性能的新指标,并通过大量实验验证其有效性。

Details

Motivation: 解决当前缺乏原则性方法来理解何种视觉编码器适合VLM对齐的问题,传统选择策略(如选择最大模型或最高零样本精度)与最终VLM性能关联性弱。

Result: 在超过60次完整VLM训练实验上验证,提出的基于Gromov-Wasserstein距离的推断指标显著优于其他模型选择策略,与最终VLM性能表现出强相关性,能在完整训练前有效预测性能。

Insight: 创新点在于首次将跨模态结构相似性(通过Gromov-Wasserstein距离量化)确立为视觉编码器选择的关键因素,并从理论上证明了跨模态映射的可学习性与该距离的关联,为VLM模型选择提供了高效、可解释的新准则。

Abstract: Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.


[71] Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CVPDF

Jin Tong, Guang Liang, Peilin Sun, Jianxin Wu

TL;DR: 本文提出了一种名为共线性衰减(Colinearity-Decay, CD)的结构正则化方法,用于在训练视觉Transformer(ViT)时控制激活异常值,从而提升模型在低比特量化部署时的性能。该方法通过惩罚Transformer块内有序矩阵对之间的有害对齐来缓解极端激活,无需改变模型架构或任务损失,且不增加推理开销。

Details

Motivation: 低比特量化是高效部署视觉Transformer的实用途径,但激活异常值阻碍了完全量化部署。现有方法要么在训练后处理量化,要么在训练中抑制大激活值,但过度限制异常值可能导致全精度与量化精度之间的权衡变差。本文认为,训练目标应控制使异常值有害的结构性放大,而非简单抑制。

Result: 在ImageNet-1K预训练、COCO检测及下游微调任务中,CD方法在多个流程中一致提升了量化精度,同时保持甚至改进了全精度性能,实现了零推理开销的低比特部署准备。

Insight: 创新点在于引入结构正则化(CD)来直接控制Transformer块内矩阵对的结构对齐,从而缓解量化有害的激活异常值,这是一种非侵入式、解耦的更新方法,在训练中引入最小开销,有效提升了模型对量化的友好性。

Abstract: Low-bit quantization is a practical route for efficiently deploying vision Transformers, yet activation outliers complicate fully quantized deployment. Existing methods either handle quantization post-training or suppress large activations during training; however, aggressively restricting outliers in vision models can lead to a poorer trade-off between full-precision and quantized accuracy. We argue that rather than simply suppressing outliers, the training objective should control the structural amplification that makes them harmful. To this end, we introduce Colinearity-Decay (CD), a structural regularizer for ordered matrix pairs within Transformer blocks. CD penalizes detrimental cross-matrix alignment and mitigates extreme activations without altering the architecture or task loss. Applied as a decoupled update, CD is non-invasive and introduces minimal training overhead. Across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning, CD consistently boosts quantized accuracy across multiple pipelines while preserving, or even improving, full-precision performance. Ultimately, our results demonstrate that structural regularization effectively prepares vision Transformers for low-bit deployment with zero inference-time overhead.


[72] Active Reasoning Vision-Language Models via Sequential Experimental Design cs.CV | cs.AI | cs.LGPDF

Anjie Liu, Ziqin Gong, Yan Song, Yuxiang Chen, Xiaolong Liu

TL;DR: 本文针对现代视觉语言模型(VLMs)中存在的感知带宽瓶颈问题,提出了一种基于顺序贝叶斯最优实验设计(S-BOED)的主动推理框架。该框架将视觉感知建模为一个顺序决策过程,通过平衡空间覆盖范围和分辨率来获取精细细节,以支持复杂推理。作者提出了一种无需训练、可灵活集成多种优化算法的推理策略,并在千兆像素级基准测试中验证了其有效性,显著提升了现有SOTA模型的性能。

Details

Motivation: 现代视觉语言模型的视觉感知受限于感知带宽瓶颈:宽视野会牺牲复杂推理所需的细粒度细节。为解决这一问题,论文受主动视觉和信息觅食经典范式启发,将其建模为一个顺序决策过程。

Result: 在千兆像素级基准测试上的实证评估表明,该方法进一步提升了最先进(SOTA)模型的性能,显著优于标准基线,并有效缩小了与人工标注的“神谕”(oracle)结果之间的差距。

Insight: 创新点在于将视觉感知的带宽瓶颈问题形式化为顺序贝叶斯最优实验设计(S-BOED)问题,并推导出在连续千兆像素空间中可处理的近似解。提出的无需训练推理策略作为一个灵活模板,可容纳从贪婪采样到前瞻规划等多种优化算法,以近似最优设计,从而在保持模型通用性的同时提升细粒度推理能力。

Abstract: Visual perception in modern Vision-Language Models (VLMs) is constrained by a fundamental perceptual bandwidth bottleneck: a broad field of view inevitably sacrifices the fine-grained details necessary for complex reasoning. Inspired by the classical paradigms of active vision and information foraging, we frame overcoming this limitation as a sequential decision-making process. We formalise this process through the lens of the sequential Bayesian optimal experimental design (S-BOED) problem. While exact Bayesian inference is intractable in continuous gigapixel spaces, we derive principled yet tractable approximations that balance spatial coverage against resolution. To validate this framework, we present a training-free inference strategy as a practical instantiation of the S-BOED objective for agents equipped with multiple vision tools. Designed as a flexible template, this strategy accommodates arbitrary optimisation algorithms, ranging from efficient greedy sampling to look-ahead planning, to approximate the optimal design. Empirical evaluations on gigapixel-level benchmarks demonstrate that our approach further boosts the performance of state-of-the-art models, significantly outperforming standard baselines and effectively narrowing the gap towards human-annotated oracles.


[73] CHASE: Competing Hypotheses for Ambiguity-Aware Selective Prediction cs.CVPDF

Kartik Jhawar, Yuhao Geng, Atul N. Parikh, Lipo Wang

TL;DR: 本文提出了CHASE框架,一种基于竞争性假设的模糊感知选择性预测方法,通过显式比较结构化时间解释来决定是否做出决策或弃权,以应对部分可观测场景下的不确定性挑战。

Details

Motivation: 标准选择性预测方法通常基于单一预测分支估计不确定性,在局部时间证据可能相互矛盾的部分可观测场景中,传统置信度分数容易产生误导,因此需要一种能处理结构化模糊性的方法。

Result: 在隐藏连接推断任务中,使用受巨型单层囊泡(GUV)动力学启发的物理模拟器进行评估,CHASE在80%覆盖率下实现了整体对齐相对平均提升11.0%,在高模糊区域三向准确率相对提升8.8%,并在90%覆盖率下将整体风险降低9.9%,优于基准方法。

Insight: 创新点在于通过优化基于竞争性假设边界的排序感知选择器,显式推理竞争性假设以区分安全决策与根本不确定性,从而在结构化模糊场景中提供更可靠的决策平衡;该方法支持零样本定性迁移至真实GUV视频,无需重新训练或微调。

Abstract: Standard selective prediction methods typically estimate uncertainty from the output of a single predictive branch. While effective for general uncertainty estimation, these approaches often struggle under partial observability, where local temporal evidence can be contradictory and standard confidence scores become misleading. We introduce CHASE (Competing Hypotheses for Ambiguity-Aware Selective Prediction), a selective prediction framework that explicitly compares structured temporal explanations to determine whether to commit to a decision or abstain. Because genuine ambiguity causes the score gap between competing hypotheses to collapse, CHASE optimizes a ranking-aware selector over these hypothesis margins to globally separate safe commitments from fundamentally uncertain ones. We evaluate this framework on the problem of hidden connectivity inference, utilizing a controlled, physically grounded simulator inspired by the dynamics of giant unilamellar vesicles (GUVs), alongside zero-shot qualitative transfer (without retraining or fine tuning) to representative real GUV videos. Our experiments demonstrate that explicitly reasoning over competing hypotheses provides a superior balance of metrics. Compared to canonical uncertainty baselines, CHASE achieves statistically significant gains in overall no-abstain accuracy, three-way accuracy, and overall ambiguity-aligned abstention (at 80% coverage). Specifically, it yields up to an 11.0% relative mean improvement in overall alignment, alongside up to an 8.8% relative boost in three-way accuracy in the very-high ambiguity regime. By maintaining a selective risk boundary strictly at par with the best baselines at 80% coverage, and reducing overall risk by 9.9% at 90% coverage, this framework offers a more reliable approach to decision-making under structured ambiguity.


[74] AgriKD: Cross-Architecture Knowledge Distillation for Efficient Leaf Disease Classification cs.CV | cs.AIPDF

Minh-Dung Le, Minh-Duc Hoang, Hoang-Vu Truong, Thi-Thu-Hong Phan

TL;DR: 本文提出AgriKD,一种跨架构知识蒸馏框架,用于高效的叶片病害分类。该方法将Vision Transformer (ViT)教师模型的知识蒸馏到轻量级卷积学生模型中,通过整合输出、特征和关系层面的多个蒸馏目标来弥合Transformer与CNN之间的表示差距,从而在保持高性能的同时大幅提升效率。

Details

Motivation: 动机在于解决叶片病害自动分类在资源受限的边缘设备上部署的难题。Vision Transformers (ViTs)虽然具有强大的表示能力,但计算成本高,难以直接部署;而现有方法难以有效地将这些丰富表示迁移到轻量模型中。

Result: 在多个叶片病害数据集上的实验表明,蒸馏后的学生模型性能与教师模型相当,同时模型参数量减少约172倍,计算成本降低47.57倍,推理延迟降低18-22倍。在ONNX、TFLite Float16和TensorRT FP16等多种运行时格式上部署时,预测性能一致且精度下降可忽略,在NVIDIA Jetson边缘设备和移动应用上实现了可靠的实时推理。

Insight: 创新点在于提出了一种跨架构知识蒸馏框架,通过整合多层次的蒸馏目标(输出、特征、关系)来有效迁移ViT的全局表示能力到CNN学生模型,从而在保持高精度的同时实现显著的效率提升,为资源受限环境下的农业AI应用提供了实用解决方案。

Abstract: Automated leaf disease classification is critical for early disease detection in resource-constrained field environments. Vision Transformers (ViTs) provide strong representation capability by modeling long-range dependencies and inter-class relationships; however, their high computational cost makes them impractical for deployment on edge devices. As a result, existing approaches struggle to effectively transfer these rich representations to lightweight models. This paper introduces AgriKD, a cross-architecture knowledge distillation framework for efficient edge deployment, which transfers knowledge from a Vision Transformer (ViT) teacher to a compact convolutional student model. To bridge the representational gap between Transformer and CNN architectures, the proposed approach integrates multiple distillation objectives at the output, feature, and relational levels, where each objective captures a different aspect of the teacher knowledge. This enables the student model to better preserve and utilize transformer-derived global representations. Experiments on multiple leaf disease datasets show that the distilled student achieves performance comparable to the teacher while significantly improving efficiency, reducing model parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. Furthermore, the optimized model is deployed across multiple runtime formats, including ONNX, TFLite Float16, and TensorRT FP16, achieving consistent predictive performance with negligible accuracy degradation. Real-world deployment on NVIDIA Jetson edge devices and a mobile application demonstrates reliable real-time inference, highlighting the practicality of AgriKD for AI-powered agricultural applications in resource-constrained environments.


[75] VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection cs.CV | cs.ROPDF

Haowen Sun, Shaolong Zhang, Mingyang Li, Chengzhong Ma, Xinzhe Chen

TL;DR: 本文提出VoxAfford方法,用于开放词汇3D可供性检测任务,该方法通过将预训练3D VQVAE编码器的多尺度几何特征注入到多模态大语言模型生成的输出token中,以增强其空间感知能力,从而更准确地根据文本描述在点云上定位交互区域。

Details

Motivation: 现有方法利用多模态大语言模型生成特殊输出token进行分割,但这些token通过自回归生成,建模的是序列依赖而非空间邻域关系,导致其语义丰富但空间信息贫乏,难以精确进行3D定位。

Result: 在开放词汇可供性检测任务上的实验表明,VoxAfford达到了最先进的性能,mIoU指标提升了约8%,真实机器人实验也证实了其对新物体的零样本迁移能力。

Insight: 核心创新点在于绕过自回归生成的瓶颈,通过跨注意力机制,利用输出token的语义作为查询,从其配对的体素尺度中检索相关几何模式,并使用学习的兼容性门控控制注入强度,从而将丰富的几何特征与语义token融合,生成空间感知的提示以指导最终分割。

Abstract: Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate controlling the injection strength. The enhanced tokens are then aggregated into a spatially-aware affordance prompt through semantic-conditioned attention and propagated alongside per-point features to generate the final mask. Experiments on open-vocabulary affordance detection tasks show that VoxAfford achieves state-of-the-art performance with approximately an 8% improvement in mIoU, and real robot experiments confirm zero-shot transfer to novel objects.


[76] VISTA: Video Interaction Spatio-Temporal Analysis Benchmark cs.CVPDF

Alejandro Aparcedo, Akash Kumar, Aaryan Garg, Dalton Pham, Wen-Kai Chen

TL;DR: 本文介绍了VISTA,一个用于评估视觉语言模型在视频中时空理解能力的诊断性基准。该基准专注于开放集、多实体、多动作的交互理解,通过将视频分解为实体、动作和关系动态,提供了一个统一的评估框架。

Details

Motivation: 现有基准主要评估简单单动作视频、封闭属性集和受限实体类型,无法捕捉真实世界视频中自由形式、多实体间的多动作交互,且缺乏跨互补时空轴分析模型失败的系统性框架。

Result: 在VISTA基准上系统评估了11个最先进的视觉语言模型,并基于其分类法分解了总体性能,揭示了传统指标所掩盖的模型缺陷和显著的时空偏差。

Insight: 创新点在于提出了首个大规模、交互感知的时空理解诊断基准,通过可解释的实体-动作-关系分解和多轴诊断,为模型设计、预训练策略和评估协议提供了细致的指导框架。

Abstract: Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.


[77] Recall to Predict: Grounding Motion Forecasting in Interpretable Motion Bank cs.CVPDF

Abhishek Vivekanandan, Ahmed Abouelazm, J. Marius Zöllner

TL;DR: 本文提出了一种名为’Recall to Predict’的端到端可微分运动预测框架,通过构建一个结构化的’运动库’来提升预测的可解释性。该方法利用新颖的锚点检索层动态检索显式的运动先验,并使用DETR风格的解码器进行几何细化,从而避免了传统基于锚点方法中不透明的潜在查询问题,在保持多模态预测准确性的同时增强了模型的可解释性。

Details

Motivation: 解决运动预测中通常存在的可解释性与预测准确性之间的权衡问题。传统的基于锚点的架构依赖于不透明的潜在查询,容易导致潜在崩溃,或者采用简单的轨迹采样限制了多模态的多样性。

Result: 在Argoverse 2和Waymo Open Motion数据集上取得了具有竞争力的多模态预测准确率。

Insight: 核心创新在于提出了一个结构化的’运动库’作为可解释的运动先验空间,并通过锚点检索层进行动态检索,将预测过程严格建立在多样且可解释的运动基元之上。方法上结合了对比学习构建运动库、双层级门控交叉注意力机制、直通式Gumbel-Softmax估计器进行离散选择,以及WTA运动学高斯混合模型和潜在多样性惩罚等联合优化策略,实现了端到端的可微分训练,有效消除了传统’黑盒’潜在查询的问题。

Abstract: Motion forecasting often requires trading interpretability for predictive accuracy. Standard anchor-based architectures rely on opaque latent queries that are highly prone to latent collapse, or naive trajectory sampling that limits multi-modal diversity. We propose an end-to-end differentiable framework that grounds predictions in a comprehensive “motion bank”, a structured embedding space of physically realizable trajectories constructed via contrastive learning. Rather than regressing paths from a blank slate, our architecture dynamically retrieves explicit motion priors using a novel Anchor Retrieval Layer. This module adapts orthogonally initialized queries via a Dual-Level Gated Cross-Attention mechanism and executes discrete trajectory selection using a Straight-Through Gumbel-Softmax estimator to preserve continuous gradient flow. The retrieved semantically grounded anchors are then geometrically refined by a DETR-style decoder, optimized jointly with a Winner-Takes-All (WTA) kinematic Gaussian Mixture Model (GMM), a latent diversity penalty, and a soft-min weighted endpoint loss. By strictly conditioning the decoding phase on diverse, interpretable motion primitives, our approach eliminates the “black box” of standard latent queries while achieving competitive multi-modal accuracy on the Argoverse 2 and Waymo Open Motion datasets. Code is available at: https://github.com/abviv/recall2predict


[78] Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence cs.CVPDF

Panagiotis P. Filntisis, George Retsinas, Radek Daněček, Vanessa Sklyarova, Petros Maragos

TL;DR: MOCHI是一种无需配准训练数据的多视角3D人脸预测框架,通过伪线性逆运动学求解器确保拓扑一致性,并利用合成数据训练的2D关键点引导语义对齐,提出基于点图和法向的损失函数以提升训练稳定性与重建质量,结合测试时优化实现高精度重建。

Details

Motivation: 解决现有学习型多视角人脸重建方法(如ToFu和TEMPEH)依赖耗时手动配准数据作为训练监督的问题,旨在实现完全无需配准训练数据的自动化3D人脸捕获。

Result: 在重建精度和视觉质量上超越了传统劳动密集型配准流程,通过测试时优化在少量迭代后达到更高精度,具体基准未明确提及但暗示优于现有方法。

Insight: 创新点包括:无需配准训练数据的框架设计、伪线性逆运动学求解器确保拓扑一致性、合成数据训练的2D关键点用于语义对齐、提出点图和法向损失以改善训练稳定性,以及测试时优化结合前馈效率与迭代优化精度。

Abstract: Recent frameworks like ToFu and TEMPEH provide an automated alternative to classical registration pipelines by predicting 3D meshes in dense semantic correspondence directly from calibrated multi-view images. However, these learning-based methods rely on the slow, manual registration pipelines they aim to replace for their training supervision. We overcome this limitation with MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a multi-view 3D face prediction framework trained without requiring registered training data. MOCHI eliminates the registration data dependency by enforcing topological consistency through a pseudo-linear inverse kinematic solver. Semantic alignment is guided by dense keypoints from a 2D landmark predictor trained exclusively on synthetic data. Our analysis further reveals that standard point-to-surface distances induce training instabilities and visual artifacts in registration-free settings. We propose pointmap- and normal-based losses instead, which provide smoother gradients and superior reconstruction fidelity. Finally, we introduce a test-time optimization scheme that refines network weights over a few dozen iterations. This approach bridges the gap between feed-forward efficiency and iterative optimization precision, allowing MOCHI to outperform traditional labor-intensive pipelines in both reconstruction accuracy and visual quality. Code and model are public at: https://filby89.github.io/mochi.


[79] SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion cs.CV | cs.LGPDF

Zhaoyang Li, Zhichao You, Tianrui Li

TL;DR: 本文提出SplAttN方法,通过可微分高斯软光栅化替代硬投影,构建密集连续的图像平面表示,以解决点云补全中跨模态熵崩溃问题,并在PCN、ShapeNet-55/34和KITTI基准上实现了SOTA性能。

Details

Motivation: 现有多模态点云补全方法中,硬投影导致稀疏支持,阻碍视觉先验传播,引发跨模态熵崩溃,限制了模态间有效连接。

Result: 在PCN和ShapeNet-55/34数据集上达到SOTA;在真实世界KITTI基准的对抗性评估中,SplAttN保持对视觉线索的鲁棒依赖,而基线方法退化为对视觉移除不敏感的单模态模板检索器。

Insight: 将投影重构为连续密度估计,避免稀疏支持崩溃,促进梯度流动,增强跨模态连接的可学习性;通过高斯软光栅化与注意力机制,有效桥接2D与3D表示。

Abstract: Although multi-modal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the connection between modalities, yet we identify that standard hard projection severs this connection: projecting a sparse point cloud onto the image plane yields an extremely sparse support, which hinders visual prior propagation, a failure mode we term Cross-Modal Entropy Collapse. To address this practical limitation, we propose SplAttN, which replaces hard projection with Differentiable Gaussian Splatting to produce a dense, continuous image-plane representation. By reformulating projection as continuous density estimation, SplAttN avoids collapsed sparse support, facilitates gradient flow, and improves cross-modal connection learnability. Extensive experiments show that SplAttN achieves state-of-the-art performance on PCN and ShapeNet-55/34. Crucially, we utilize the real-world KITTI benchmark as a stress test for multi-modal reliance. Counter-factual evaluation reveals that while baselines degenerate into unimodal template retrievers insensitive to visual removal, SplAttN maintains a robust dependency on visual cues, validating that our method establishes an effective cross-modal connection. Code is available at https://github.com/zay002/SplAttN.


[80] LIE: LiDAR-only HD Map Construction with Intensity Enhancement via Online Knowledge Distillation cs.CV | cs.AI | cs.LGPDF

Kanak Mazumder, Fabian B. Flohr

TL;DR: 本文提出了一种名为LIE的纯激光雷达高精地图构建方法,通过在线知识蒸馏技术,利用强度图增强来解决激光雷达缺乏密集语义和纹理信息的问题。该方法在nuScenes数据集上超越了所有单模态方法,并显著优于基于相机的最先进模型。

Details

Motivation: 在线高精地图构建是自动驾驶的关键组成部分。现有基于多视角相机图像的方法成本较低但缺乏深度信息,而激光雷达能提供精确的3D测量但缺少密集语义线索。本文旨在开发一种仅使用激光雷达就能构建高质量语义地图的方法。

Result: 在nuScenes数据集上,该方法比最先进的基于相机模型的mIoU高出8.2%,超越了所有单模态方法。该方法在长距离、恶劣天气和光照条件下表现鲁棒,并且仅需10%的微调就能高效适应Argoverse2数据集,性能超过在完整数据集上训练的基于相机模型。

Insight: 核心创新点在于提出了一种在线知识蒸馏框架,其中教师分支融合学生激光雷达特征和对应的2D强度图块,为地图元素分割提供密集监督。这有效弥补了激光雷达模态的语义信息不足,实现了仅依赖激光雷达的高性能语义地图构建,并展示了出色的跨数据集适应能力。

Abstract: Online High-Definition (HD) map construction is a key component of autonomous driving. Recent methods rely on multi-view camera images for cost-effective HD map segmentation, but cameras lack depth information for accurate scene geometry. In contrast, LiDAR provides precise 3D measurements but lacks dense semantic cues. In this work, we propose LIE, LiDAR-only semantic map construction method that employ Knowledge Distillation (KD) to handle the lack of dense semantic and texture cues. Specifically, the teacher branch fuses student LiDAR features and the corresponding 2D intensity map tile to provide dense supervision for segmenting map elements using online distillation scheme. Experimental results show that our method outperforms all single-modality approaches, achieving 8.2% higher mIoU than the state-of-the-art camera-based model on nuScenes. LIE is robust over long ranges and under challenging weather and lighting, and efficiently adapts to Argoverse2 with only 10% fine-tuning, surpassing camera-based models trained on the full dataset. Source code will be available \href{https://iv.ee.hm.edu/lie/}{here}.


[81] AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT cs.CVPDF

Guandong Li, Mengxia Ye

TL;DR: 本文提出AttnRouter方法,用于在Qwen-Image-Edit-2511多模态扩散变换器上实现免训练的图片编辑。核心贡献包括:KVInject注意力操作、基于类别的注意力路由表,以及对编辑有效注意力子电路的定位分析。

Details

Motivation: 解决在多模态扩散变换器中,传统两阶段注意力操作方法(如MasaCtrl)因提示不匹配而失效的问题,实现更简单有效的免训练图像编辑。

Result: 在ImgEdit-Bench基准上,使用真实类别时路由表将CLIP-T+DINO-I综合评分提升6.4%;自动CLIP零样本分类器可达到98%的性能差距弥补,尽管类别准确率仅为55%。KVInject相比基线避免了31%的综合评分下降。

Insight: 创新点包括:1)KVInject单前向注意力操作简化编辑流程;2)按类别路由不同注意力操作以保持源图像结构;3)发现编辑有效的注意力子电路位于早期去噪步骤(S0-7),且alpha值在[0.3,0.5]区间最稳定。研究还揭示了UNet传统方法中的缩放操作在MMDiT中无效。

Abstract: We study training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that concatenates noise and source-image tokens within a single attention stream. We make three contributions. (i) We introduce KVInject, a single-forward attention manipulation that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band. KVInject is simpler than the classical two-pass MasaCtrl recipe and avoids the prompt-mismatch failure mode that disables MasaCtrl on MMDiT (composite score drops 31% versus baseline). (ii) We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. (iii) Through layer-, step-, and alpha-band ablations we localize the editing-effective attention sub-circuit: K/V injection in early denoising steps (S0-7) recovers nearly all of the gain of full-step injection, while injection in early (L0-15) or late (L45-60) layer bands fails to drive editing entirely; alpha in [0.3, 0.5] is a stable sweet spot. We also report negative results that highlight what does not transfer from the UNet folklore: simple K/V rescaling never beats baseline and aggressive variants collapse generation entirely (composite 0.084). We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations.


[82] Research on Vision-Language Question Answering Models for Industrial Robots cs.CV | cs.AIPDF

Ping Li, Bartlomiej Brzozka

TL;DR: 本文提出了一种用于工业机器人视觉语言问答的分层跨模态融合模型,旨在解决现代制造业中常见的语义模糊、复杂环境布局和领域特定术语等挑战。该框架整合了先进的目标检测、多尺度视觉编码、句法解析和任务感知语义注意力,将视觉和语言信号统一到联合推理空间中。

Details

Motivation: 针对工业机器人视觉语言问答中语义模糊、环境复杂和领域术语带来的挑战,提升机器人在操作查询、指令步骤和异常检测等任务中的可靠性和解释性。

Result: 在IVQA和RIF基准测试上的验证实验表明,该方法在语义对齐、Top-1准确率以及对模糊或程序性任务查询的鲁棒性方面均有提升,优于现有VLQA基准方法。

Insight: 创新点在于分层跨模态融合架构,结合了多尺度视觉特征、句法解析和自适应融合机制,通过细粒度语义对齐和上下文驱动门控,增强了工业场景下的可靠推理能力。

Abstract: A hierarchical cross-modal fusion model is proposed for vision-language question answering (VLQA) in industrial robotics, targeting the challenges of semantic ambiguity, complex environmental layouts, and domain-specific terminology common in modern manufacturing. The framework integrates advanced object detection, multi-scale visual encoding, syntactic parsing, and task-aware semantic attention to unite vision and language signals into a joint reasoning space. Region-based deep networks extract visual features, weighted embeddings aggregate, and recurrent neural parsing encodes sentence structures. Through fine-grained semantic alignment driven by adaptive fusion and cross-attention mechanisms, the system can handle operational queries, instruction steps, and anomaly detection with higher reliability. Compared to the existing VLQA benchmarks, validation experiments conducted on the IVQA and RIF benchmarks indicate improvements in semantic alignment, Top-1 accuracy, and robustness to ambiguous or procedural task queries. Ablation studies further quantify the impact of each architectural module, confirming the necessity of multi-level feature integration and context-driven gating for dependable industrial deployment. The technical advancements reported here provide core methodologies to improve the interpretability and operational effectiveness of industrial robots faced with diverse human-robot interaction tasks.


[83] SF20K Competition 2025: Summary and findings cs.CVPDF

Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, Ivan Laptev

TL;DR: 本文总结了首届SF20K竞赛(ICCV 2025 SLoMO Workshop附属竞赛)的结果与发现。该竞赛旨在推动超越短片段动作识别的故事级视频理解,基于业余短片语料库构建了一个开放式视频问答任务。竞赛吸引了22支团队提交286份方案,分为模型大小无限制的主赛道和参数限制在80亿以下的特殊赛道。获胜团队在主赛道和特殊赛道上的准确率分别为65.7%和48.7%,而人类表现上限为91.7%。分析表明,当前方法的瓶颈在于信息选择与推理结构,而非原始模型容量。

Details

Motivation: 解决现有视频理解研究多集中于短片段动作识别,缺乏对长视频(如短片)叙事级理解的问题,通过构建开放式视频问答任务,迫使模型依赖多模态理解而非对流行电影的机械记忆。

Result: 在SF20K-Test基准(95部电影,979个问答对)上,使用基于GPT-4.1-nano的LLM-QA-Eval自动评估。主赛道最佳准确率为65.7%,特殊赛道为48.7%,人类表现上限为91.7%。

Insight: 创新点包括:1)引入基于业余短片的故事级视频问答任务,减少数据偏差;2)发现叙事感知、镜头级处理优于均匀帧采样;3)设计良好的多阶段小模型流水线可匹配或超越大30倍以上的端到端模型;4)字幕质量是性能主导因素。客观分析认为,该研究强调了长视频QA的核心瓶颈是信息选择与推理结构优化,而非单纯扩大模型规模。

Abstract: This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Competition, held in conjunction with the SLoMO Workshop at ICCV 2025. The competition is designed to advance story-level video understanding beyond short-clip action recognition, introducing an open-ended video question-answering task built on a corpus of amateur short films. This setup ensures that models must rely on multimodal understanding rather than memorization of popular movies. Evaluation is conducted using the SF20K-Test benchmark (95 movies, 979 question-answer pairs) and scored via LLM-QA-Eval, an automated judge based on GPT-4.1-nano. The competition attracted 22 teams and 286 submissions across two tracks: a Main Track with unrestricted model size and a Special Track limited to models under 8 billion parameters. The winning team achieved 65.7% accuracy on the Main Track and 48.7% on the Special Track, against a human performance ceiling of 91.7%. Our analysis reveals several key findings: narrative-aware, shot-level processing consistently outperforms uniform frame sampling; well-designed multi-stage pipelines using smaller models can match or exceed end-to-end inference with models over 30x larger; and subtitle quality is a dominant factor in performance. These results highlight that the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity, and that a substantial gap remains between current methods and human-level narrative comprehension.


[84] Towards Visual Query Localization in the 3D World cs.CVPDF

Liang Peng, Bohan Tan, Zhipeng Zhang, Haobo Li, Yifan Jiao

TL;DR: 本文首次提出并构建了3D视觉查询定位(3DVQL)基准数据集,包含多模态数据(点云、RGB图像、深度图像)和高质量标注,用于解决3D空间中的视觉查询定位问题。作者还实现了一系列基线模型,并提出了名为LaF的提升-注意力融合算法,显著超越了现有基线。

Details

Motivation: 当前视觉查询定位研究主要集中在2D视频领域,而3D空间中的对应问题尚未得到充分探索,本文旨在填补这一空白。

Result: 在提出的3DVQL基准上,现有方法在不同融合模块间表现出显著的性能差异;提出的LaF算法显著优于现有基线模型。

Insight: 创新点包括构建首个3D多模态视觉查询定位基准数据集,并提出了一种有效的多模态融合算法LaF,为3D VQL研究提供了基础平台和性能提升方案。

Abstract: Visual query localization (VQL) aims to predict the spatio-temporal response of the most recent occurrence in a sequence given a query. Currently, most research focuses on visual query localization in 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to address visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities, including point clouds, RGB images, and depth images, to support flexible research. To ensure high-quality annotations, each sequence is manually annotated with multiple rounds of verification and refinement. To the best of our knowledge, 3DVQL is the first benchmark for 3D multimodal visual query localization. To facilitate comparison in subsequent research, we implement a series of representative 3D multimodal VQL baselines using point clouds and RGB images. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift-and-attention fusion algorithm named LaF, which significantly outperforms existing baseline models. Our benchmark and model will be publicly released at https://github.com/wuhengliangliang/3DVQL.


[85] OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder cs.CVPDF

Detao Bai, Shimin Yao, Weixuan Chen, Chengen Lai, Yuanming Li

TL;DR: 本文提出了Omni-Encoder,一种统一的Transformer骨干网络,旨在以对称的25 fps速率在共享的潜在空间中共同嵌入视觉和音频信号,以解决现有多模态大语言模型中视频-音频编码不匹配的问题。

Details

Motivation: 当前的多模态大语言模型通常采用‘视频粗粒度、音频细粒度’的模态特定编码器设计,导致跨模态交互贫乏且无法捕捉细粒度的视觉运动,与人类整体感知方式存在差距。

Result: 在相同的LLM解码器输入token预算下,相比模态特定的基线模型Qwen2.5-Omni,Omni-Encoder在视觉连续理解任务(如手语识别和细粒度体育动作分析)上取得显著提升,同时在AVQA和说话人识别与定位等成熟视听基准上保持竞争力。

Insight: 创新点包括Omni-Encoder Token Template、Omni-RoPE和Temporal Window Shifting三项核心技术,有效解决了模态解耦和计算效率的双重挑战,为构建更接近人类集成感知的统一编码器提供了方向。

Abstract: Recent advances in omni-modal large language models have enabled remarkable progress in joint vision-audio understanding. However, prevailing architectures rely on modality-specific encoders with a \emph{video-coarse, audio-dense} design – sampling visual frames at 1–2 fps while processing audio waveforms at 25 fps – resulting in systems that perceive video \emph{frame by frame, modality by modality} rather than holistically as humans do. Such a discrepancy leaves models with impoverished cross-modal interaction during encoding and an inability to capture fine-grained visual motion. To bridge this gap, we present \textbf{Omni-Encoder, a unified Transformer backbone designed to co-embed visual and audio signals at a symmetrical 25 fps} within a shared latent space. This architecture leverages three core innovations – the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting – to effectively reconcile the dual challenges of modality disentanglement and computational efficiency. Experiments demonstrate that, compared to the modality-specific baseline Qwen2.5-Omni under the same input token budget to the LLM decoder, Omni-Encoder delivers substantial gains on visual continuous understanding tasks – such as sign language recognition and fine-grained sports action analysis – while maintaining competitive performance on established audio-visual benchmarks such as AVQA and Speaker Identification and Localization. These results suggest that unified omnivorous encoding offers a promising direction for building omni-modal models that more closely reflect the integrated nature of human perception.


[86] Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video cs.CVPDF

Jiantang Huang

TL;DR: 本文提出了一种无需微调的两阶段零样本时空定位方法,用于在监控视频中定位罕见的交通事故。该方法通过粗粒度到细粒度的两阶段处理流程,结合Qwen3-VL-Plus和Gemini 3.1 Flash-Lite两个冻结的视觉语言模型,分别负责时空定位和碰撞类型分类,实现了对事故时间、空间位置和类型的联合精确输出。

Details

Motivation: 解决在真实监控视频中定位交通事故这一罕见事件问题,由于标注数据稀缺且训练成本高,需要一种无需微调、能联合定位时间、空间和碰撞类型的零样本方法。

Result: 在ACCIDENT@CVPR 2026基准测试(包含2,027个真实监控视频)上,该方法达到了ACC^S = 0.539(95%置信区间[0.525, 0.553]),相比基准论文的最佳基线oracle(0.412)提升了0.127,优于最强的单视觉语言模型基线(Molmo-7B,0.396)和朴素基线(0.289)。

Insight: 创新点包括:1)粗粒度到细粒度的两阶段分解策略,通过两次视频处理(1 fps全视频粗定位和5 fps局部窗口精炼)结合确定性置信门控机制;2)专家角色分配,利用不同视觉语言模型的专长进行分工协作(定位与分类)。该方法在零样本设定下有效结合了现有模型的优势,以较低成本(约20美元)实现了高性能的罕见事件定位。

Abstract: Grounding traffic accidents in real CCTV footage is a rare-event problem where training on labeled accident video is often prohibited, yet accurate joint localization in time, space, and collision type is required. We present a no-fine-tuning pipeline that elicits this joint output from frozen vision-language models through two ideas. First, a coarse-to-fine two-pass decomposition: a full-video pass at 1 fps produces a coarse (t, x, y, c) tuple, then a second pass at 5 fps within a +/- 3 s window refines time and location, with two deterministic confidence gates that revert to the coarse estimate on boundary hedges or edge-clamped coordinates. Second, a specialist role assignment: Qwen3-VL-Plus handles grounding, Gemini 3.1 Flash-Lite handles typing on a centered video clip. On the ACCIDENT@CVPR 2026 benchmark (2,027 real CCTV videos) we reach ACC^S = 0.539 (95% CI [0.525, 0.553]): +0.127 over the benchmark paper’s best-of-baselines oracle (0.412), +0.143 over the strongest single-VLM baseline (Molmo-7B, 0.396), and +0.250 over the naive baseline (0.289). The VLM path uses up to three API calls per video (17% fall back to physics on API failures); the full run costs ~$20.


[87] VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CVPDF

Guotao Liang, Zhangcheng Wang, Chuang Wang, Juncheng Hu, Haitao Zhou

TL;DR: 本文提出了VAnim,首个基于LLM的开放域文本到SVG动画生成框架。它将动画重新定义为对持久SVG DOM树的稀疏状态更新(SSU),而非序列生成,从而在保持SVG结构的同时大幅压缩序列长度。通过识别优先的运动规划机制和基于组相对策略优化(GRPO)的渲染感知强化学习,实现了对文本指令的精确控制和高保真视觉反馈。

Details

Motivation: 解决现有方法在生成可缩放矢量图形(SVG)动画时面临的挑战:基于优化的方法常破坏拓扑一致性,而通用LLM依赖刚性CSS/SMIL变换,无法建模几何级别的非刚性形变。

Result: 在提出的首个矢量动画基准SVGAnim-134k上进行广泛实验,VAnim在语义对齐和结构有效性方面显著优于现有最先进基线,并通过附加指标进一步验证了运动质量和身份保持。

Insight: 创新点包括:将动画重新概念化为SVG DOM树的稀疏状态更新(SSU),实现结构保持和序列压缩;提出识别优先的运动规划机制,将文本指令锚定于显式视觉实体;采用基于GRPO的渲染感知强化学习,通过混合奖励对齐离散代码更新与视觉反馈。

Abstract: Scalable Vector Graphics (SVG) animation generation is pivotal for professional design due to their structural editability and resolution independence. However, this task remains challenging as it requires bridging discrete code representations with continuous visual dynamics. Existing optimization-based methods often destroy topological consistency, while general-purpose LLMs rely on rigid CSS/SMIL transformations, failing to model geometry-level non-rigid deformations. To address these limitations, we present VAnim, the first LLM-based framework for open-domain text-to-SVG animation. We reconceptualize animation not as sequence generation, but as Sparse State Updates (SSU) on a persistent SVG DOM tree. This paradigm compresses sequence length by over 9.8x while preserving the SVG DOM structure and non-participating elements by construction. To enable precise control, we propose an Identification-First Motion Planning mechanism that grounds textual instructions in explicit visual entities. Furthermore, to overcome the non-differentiable nature of SVG rendering, we employ Rendering-Aware Reinforcement Learning via Group Relative Policy Optimization (GRPO). By leveraging a hybrid reward from a state-of-the-art video perception encoder, we align discrete code updates with high-fidelity visual feedback. We also introduce SVGAnim-134k, the first benchmark for vector animation. Extensive experiments demonstrate that VAnim significantly outperforms state-of-the-art baselines in semantic alignment and structural validity, with additional appendix metrics further validating motion quality and identity preservation.


[88] MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models cs.CV | cs.CLPDF

Yin Zhang, Jiaxuan Zhao, Zonghan Wu, Zengxiang Li, Junfeng Fang

TL;DR: MIRL提出了一种基于互信息的强化学习框架,用于解决视觉语言模型在复杂推理任务中的视觉感知错误和幻觉问题。该方法通过互信息预筛选轨迹,智能分配采样预算,并通过解耦训练提供独立的视觉感知优化奖励,从而提升答案准确性。

Details

Motivation: 现有基于可验证奖励的强化学习方法存在两个关键局限:一是采样预算浪费在因早期视觉描述错误而注定失败的轨迹上;二是稀疏奖励无法区分失败源于视觉感知还是推理阶段。MIRL旨在通过互信息引导解决这些问题。

Result: 在六个视觉语言推理基准测试中,MIRL实现了70.22%的平均准确率,仅使用10个预采样样本和top-6选择(减少25%完整轨迹)就超越了采样16个完整轨迹的性能。

Insight: 创新点包括利用视觉输入与生成描述之间的互信息作为廉价预筛选信号,实现智能预算分配;通过解耦训练提供独立的基于互信息的奖励,解决奖励稀疏性问题,从而分别优化视觉感知和推理阶段。

Abstract: Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: https://anonymous.4open.science/r/mirl-main/.


[89] Unifying Deep Stochastic Processes for Image Enhancement cs.CVPDF

Wojciech Kozłowski, Radosław Kuczbański, Kamil Adamczewski, Karol Szczypkowski, Maciej Zięba

TL;DR: 本文提出了一种统一视角,将深度随机过程在图像增强中的应用归纳为三类连续时间过程:无条件扩散模型、Ornstein-Uhlenbeck过程和扩散桥。研究表明,这些方法都源于一个共同的随机微分方程框架,其差异主要体现在漂移项、扩散项、终端分布和边界条件上。基于此统一框架,作者进行了受控实证研究,并发布了模块化PyTorch库ItoVision以促进快速原型设计和公平比较。

Details

Motivation: 解决当前各种基于深度随机过程的图像增强方法之间关系不明确的问题,旨在提供一个统一的理论框架来理解和比较这些方法。

Result: 在多个图像增强任务上使用相同架构和训练协议进行的受控实证研究表明,没有一种方法始终占优;研究识别并解耦了对性能影响最大的具体设计选择。

Insight: 创新点在于提供了一个统一的随机微分方程框架,将看似不同的方法系统分类,并揭示了影响性能的关键设计因素(如漂移/扩散项、边界条件),而非调度器或采样器;同时发布的ItoVision库为领域提供了实用的工具。

Abstract: Deep stochastic processes have recently become a central paradigm for image enhancement, with many methods explicitly conditioning the stochastic trajectory on the degraded input. However, the relationship between these conditional processes and standard diffusion models remains unclear. In this work, we introduce a unified perspective on stochastic image enhancement by classifying recent methods into three families of continuous-time processes: unconditional diffusion models, Ornstein-Uhlenbeck (OU) processes, and diffusion bridges. We show that all of these approaches arise from a common stochastic differential equation (SDE) formulation. This framework makes explicit that seemingly disparate methods differ primarily in their drift and diffusion terms, terminal distributions, and boundary conditions, while schedulers and samplers constitute orthogonal design choices. Leveraging this unification, we conduct a controlled empirical study across multiple image enhancement tasks using identical architectures and training protocols. Our results reveal no consistently dominant method; instead, we identify and disentangle the specific design choices that most strongly influence performance. Finally, we release ItoVision, a modular PyTorch library that implements the unified framework and enables rapid prototyping and fair comparison of stochastic image enhancement methods.


[90] Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection cs.CVPDF

Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Xinze Li

TL;DR: 本文提出了Omni-Fake,一个用于社交媒体场景下多模态深度伪造检测的统一基准,包含大规模高质量数据集Omni-Fake-Set和用于评估泛化能力的分布外基准Omni-Fake-OOD。同时,论文还提出了Omni-Fake-R1,一种基于强化学习的多模态检测器,能够自适应融合视听线索并输出结构化决策、定位和自然语言解释。

Details

Motivation: 现有深度伪造检测基准受限于单模态范围、简化的操纵或不现实的分布,难以评估模型在真实世界中的鲁棒性。为应对多模态深度伪造在社交媒体上的泛滥及其对真实性、信息完整性和数字取证的威胁,需要构建更全面的评估体系。

Result: 在Omni-Fake基准上的大量实验表明,所提出的Omni-Fake-R1检测器在检测精度、跨模态泛化能力和可解释性方面均显著优于现有最先进的基线方法。

Insight: 创新点在于构建了一个统一的多模态(图像、音频、视频、音视频说话头)深度伪造检测基准,并支持联合检测-定位-解释协议;同时提出了一种基于强化学习的自适应多模态融合检测器,能够输出结构化决策和自然语言解释,提升了模型的可解释性和泛化能力。

Abstract: Multimodal deepfakes are proliferating on social media and threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. To address these limitations, we present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 200k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities (image, audio, video, and audio-video talking head) and supports a joint detection-localization-explanation protocol. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Project page: https://tianxiao1201.github.io/omni-fake-project-page/


[91] Act2See: Emergent Active Visual Perception for Video Reasoning cs.CVPDF

Martin Q. Ma, Yuxiao Qu, Aditya Agrawal, Willis Guo, Paul Pu Liang

TL;DR: Act2See是一个新颖的框架,旨在赋予视觉语言模型(VLMs)在视频推理任务中主动的视觉感知能力。它通过监督微调(SFT)在一个由前沿VLM生成的高质量推理轨迹数据集上训练,使模型能够在文本思维链(CoT)中主动地交错调用视频帧(检索现有帧或生成新帧),从而在推理时动态地决定何时搜索或合成必要的视觉证据。

Details

Motivation: 解决现有VLMs在视频推理中通常依赖静态初始帧,无法随着推理过程演进而整合关键动态信息的问题,以及现有增强CoT的方法存在CoT质量欠佳、缺乏为假设或反事实场景合成视觉信息的关键能力。

Result: 在VideoEspresso和ViTIB等具有挑战性的基准测试中取得了新的最先进(SOTA)结果,并在Video-MME、EgoNormia和VCR-Bench上优于可比或更大的模型。

Insight: 核心创新在于通过训练使模型在推理过程中主动决定何时需要获取(检索或生成)新的视觉证据,从而涌现出主动视觉感知能力。这突破了传统被动处理固定帧序列的模式,实现了更动态、更贴合推理需求的视频理解,其高质量、经过人工验证的SFT数据集构建方法也值得借鉴。

Abstract: Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.


[92] TRIMMER: A New Paradigm for Video Summarization through Self-Supervised Reinforcement Learning cs.CV | cs.AIPDF

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

TL;DR: 本文提出了一种名为TRIMMER的新型自监督强化学习框架,用于视频摘要任务。该框架通过两个阶段运作:首先通过自监督学习学习鲁棒的视频表示,然后通过信息论奖励函数引导的强化学习进行时空决策。该方法引入了基于熵的度量来捕获高阶时间动态和语义多样性,并通过直接计算选定帧索引的奖励来提高计算效率。

Details

Motivation: 解决现有视频摘要方法依赖昂贵人工标注、跨领域泛化能力差、计算成本高,以及无监督和弱监督方法在捕获长程时间依赖和语义结构方面性能不足的问题。

Result: 在标准基准测试上的广泛实验表明,TRIMMER在无监督和自监督方法中达到了最先进的性能,同时与领先的监督方法保持竞争力。

Insight: 创新点在于将自监督表示学习与信息论引导的强化学习相结合,并使用基于熵的奖励函数来捕获复杂的时间动态,而非传统的基于相似性的目标。这为可扩展和可泛化的视频摘要提供了一种新范式。

Abstract: The rapid growth of video content across domains such as surveillance, education, and social media has made efficient content understanding increasingly critical. Video summarization addresses this challenge by generating concise yet semantically meaningful representations, but existing approaches often rely on expensive manual annotations, struggle to generalize across domains, and incur significant computational costs due to complex architectures. Moreover, unsupervised and weakly supervised methods typically underperform compared to supervised counterparts in capturing long-range temporal dependencies and semantic structure. In this work, we propose TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement), a novel self-supervised reinforcement learning framework for video summarization. TRIMMER operates in two stages: it first learns robust representations via self-supervised learning and then performs spatio-temporal decision making through reinforcement learning guided by information-theoretic reward functions. Unlike prior approaches that rely on similarity-based objectives, our method introduces entropy-based metrics to capture higher-order temporal dynamics and semantic diversity, while computing rewards directly over selected frame indices to improve computational efficiency. Extensive experiments on standard benchmarks demonstrate that TRIMMER achieves state-of-the-art performance among unsupervised and self-supervised methods, while remaining competitive with leading supervised approaches, highlighting its effectiveness for scalable and generalizable video summarization.


[93] Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models cs.CVPDF

Martin Q. Ma, Willis Guo, Aditya Agrawal, Ankit Gupta, Paul Pu Liang

TL;DR: 本文提出了一种名为Video Active Perception (VAP) 的训练无关方法,用于提升大型视觉语言模型在长视频问答任务中的性能。该方法受主动感知理论启发,将关键帧选择视为主动数据获取过程,并利用轻量级文本条件视频生成模型来表征先验世界知识。

Details

Motivation: 解决大型视觉语言模型在处理长视频问答时,因标准均匀采样方法计算成本高且性能可能达到瓶颈,而面临的有效且高效选择关键帧的挑战。

Result: 在EgoSchema、NExT-QA、ActivityNet-QA、IntentQA和CLEVRER等长视频或推理视频问答数据集上,VAP实现了零样本状态的最先进结果,相比GPT-4o、Gemini 1.5 Pro和LLaVA-OV等标准方法,每个问题所需帧数效率提升了高达5.6倍。

Insight: 核心创新点是将主动感知理论引入视频理解,将关键帧选择建模为基于模型预期差异的数据获取过程,并利用视频生成模型作为先验知识源来指导选择。这提供了一种无需训练即可显著提升长视频理解效率和推理能力的新范式。

Abstract: Large vision-language models (VLMs) have advanced multimodal tasks such as video question answering (QA). However, VLMs face the challenge of selecting frames effectively and efficiently, as standard uniform sampling is expensive and performance may plateau. Inspired by active perception theory, which posits that models gain information by acquiring data that differs from their expectations, we introduce Video Active Perception (VAP), a training-free method to enhance long-form video QA using VLMs. Our approach treats keyframe selection as data acquisition in active perception and leverages a lightweight text-conditioned video generation model to represent prior world knowledge. Empirically, VAP achieves state-of-the-art zero-shot results on long-form or reasoning video QA datasets such as EgoSchema, NExT-QA, ActivityNet-QA, IntentQA, and CLEVRER, achieving an increase of up to 5.6 x frame efficiency by frames per question over standard GPT-4o, Gemini 1.5 Pro, and LLaVA-OV. Moreover, VAP shows stronger reasoning abilities than previous methods and effectively selects keyframes relevant to questions. These findings highlight the potential of leveraging active perception to improve the frame effectiveness and efficiency of long-form video QA.


[94] IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction cs.CV | cs.AI | cs.ROPDF

Haoshen Zhang, Di Wen, Kunyu Peng, David Schneider, Zeyun Zhong

TL;DR: IMPACT-HOI是一个混合主动框架,用于通过构建人类-物体交互(HOI)的结构化事件图来标注自我中心视角的程序性视频,旨在为从人类演示中学习机器人操作提供高质量的结构化监督。

Details

Motivation: 动机是为从人类演示中学习机器人操作提供高质量的结构化监督,解决HOI事件标注任务中的效率与准确性挑战。

Result: 用户研究显示,在9名参与者中,手动标注动作减少了13.5%,事件匹配率达到46.67%,且在所研究协议下零确认字段违规。

Insight: 创新点包括将任务框架化为部分指定、起始锚定事件状态的增量解析,基于信任校准的控制器选择标注策略,以及利用原子回滚的风险有界执行协议以确保人类确认决策的完整性。

Abstract: We present IMPACT-HOI, a mixed-initiative framework for annotating egocentric procedural video by constructing structured event graphs for Human-Object Interactions (HOI), motivated by the need for high-quality structured supervision for learning robot manipulation from human demonstration. IMPACT-HOI frames this task as the incremental resolution of a partially specified, onset-anchored event state. A trust-calibrated controller selects among direct queries, human-confirmed suggestions, and conservative completions based on empirical annotator behavior and evidence quality. A risk-bounded execution protocol, utilizing atomic rollback, ensures that human-confirmed decisions are preserved against conflicting automated updates. A user study with 9 participants shows a 13.5% reduction in manual annotation actions, a 46.67% event match rate, and zero confirmed-field violations under the studied protocol. The code will be made publicly available at https://github.com/541741106/IMPACT_HOI.


[95] Deep neural networks with Fisher vector encoding for medical image classification cs.CVPDF

Lucas O. Lyra, Antonio E. Fabris, Joao B. Florindo

TL;DR: 本文提出了一种将Fisher Vector(FV)无序编码方法集成到CNN+ViT混合模型中的新架构,旨在提升医学图像分类性能,特别是在数据有限的情况下。该方法通过高斯混合模型(GMM)估计图像特征,并引入一种技术以控制GMM估计成本随数据集规模增长的问题。在MedMNIST(v2)、Clean-CC-CCII和ISIC2018等多个医学图像数据集上进行了验证,取得了优于基准或具有竞争力的结果。

Details

Motivation: 解决在数据有限时CNN的分类性能受限问题,以及CNN的局部偏置问题;同时,现有混合CNN+ViT模型与更精细特征表示(如无序编码)的结合尚未充分探索,本文旨在填补这一空白,开发适用于大小数据集的鲁棒模型。

Result: 在MedMNIST(v2)所有数据集上超越了基准结果,在Clean-CC-CCII和ISIC2018上获得了与文献相当的竞争性结果,表明该方法在多种数据规模和模态下均有效。

Insight: 创新点在于将Fisher Vector无序编码与CNN+ViT混合架构结合,以增强特征表示;同时提出了一种控制GMM估计计算成本增长的方法,使FV能更高效地应用于大规模数据集。从客观角度看,这种集成策略为小样本学习和模型鲁棒性提供了新思路。

Abstract: Orderless encoding methods have shown to improve Convolutional Neural Networks (CNNs) for image classification in the context of limited availability of data. Additionally, hybrid CNN + Vision Transformers (ViT) models have been recently proposed to address CNN locality bias issues. These models outperformed CNN-only approaches. Despite that, the integration of such hybrid models with more elaborated feature representation can be highly beneficial and remains large unexplored in the literature. In this context, we propose the introduction of an orderless encoding method, Fisher Vectors, to hybrid CNN + ViT architectures, aiming at achieving a model suitable for both small and large datasets. Such enconding method relies on estimating a Gaussian Mixture Model (GMM) on image features. In large datasets, computational costs of the GMM estimation is a limiting factor for the application of Fisher Vectors. Thus, we propose a method to limit the growth of GMM estimation costs as we increase the size of the dataset. We explore the feasibility of our method in the context of medical image classification by appling it to MedMNIST (v2), Clean-CC-CCII and ISIC2018. This collection of datasets contains a wide variety of data scales and modalities. We outperform benchmark results in all MedMNIST (v2) datasets and obtain literature-competitive results in Clean-CC-CCII and ISIC2018.


[96] IMPACT-Scribe: Interactive Temporal Action Segmentation with Boundary Scribbles and Query Planning cs.CV | cs.AIPDF

Qian Yin, Di Wen, Kunyu Peng, David Schneider, Zeyun Zhong

TL;DR: IMPACT-Scribe是一个用于视频时序动作分割的交互式标注框架,旨在通过结合不确定性感知的边界涂鸦监督、局部提议建模、成本感知查询规划、结构化传播和校正驱动适应,形成一个闭环系统,以提高密集标注的效率和质量。

Details

Motivation: 解决程序性活动视频的密集时序标注任务中,现有工具反应式、孤立处理校正、无法有效利用标注者不确定性和模型可靠性信息,导致标注劳动密集型的问题。

Result: 实验和用户研究表明,该闭环设计在单位努力下提升了标注质量,增强了边界准确性,并促进了更好的人机交互。

Insight: 创新点在于将每次校正视为改进未来人机协作的机会,通过整合多种技术(如不确定性感知的边界涂鸦和成本感知查询规划)实现校正驱动的自适应,从而系统性地提升标注过程的效率和交互性。

Abstract: Dense temporal annotation of procedural activity videos is vital for action understanding and embodied intelligence but remains labor-intensive due to reactive tools. Each correction is treated as an isolated edit, limiting reuse of information on annotator uncertainty and model reliability. We introduce IMPACT-Scribe, a correction-driven framework for dense labeling that uses each correction to improve future human-machine collaboration. IMPACT-Scribe combines uncertainty-aware boundary scribble supervision, local proposal modeling, cost-aware query planning, structured propagation, and correction-driven adaptation. Experiments and a human study show that this closed-loop design improves labeling quality per effort, enhances boundary accuracy, and fosters better human-machine interaction over time. The code will be made publicly available at https://github.com/BanzQians/IMPACT_AS.


[97] TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation cs.CVPDF

Yiyao Wang, Sixian Zhang, Keming Zhang, Xinhang Song, Songjie Du

TL;DR: 本文提出TrajRAG,一个检索增强生成框架,用于零样本目标导航。它通过增量积累过去的导航轨迹经验,并利用一种新颖的拓扑-极坐标轨迹表示来结构化这些经验,从而在导航时检索相关的几何-语义经验来增强大模型的推理能力,实现终身经验积累。

Details

Motivation: 现有零样本目标导航方法依赖从互联网文本中提取的常识知识,缺乏具身的3D经验,且导航过程中的观察通常被丢弃,无法积累终身经验。

Result: 在MP3D、HM3D-v1和HM3D-v2数据集上的实验表明,TrajRAG能有效检索相关几何-语义经验,并提升了零样本目标导航的性能。

Insight: 核心创新点在于提出了一个终身经验积累框架,以及用于高效编码和检索经验的拓扑-极坐标轨迹表示和分层分块结构,实现了从原始观察到结构化知识的转化和粗到细的检索。

Abstract: Existing zero-shot Object Goal Navigation (ObjectNav) methods often exploit commonsense knowledge from large language or vision-language models to guide navigation. However, such knowledge arises from internet-scale text rather than embodied 3D experience, and episodic observations collected during navigation are typically discarded, preventing the accumulation of lifelong experience. To this end, we propose Trajectory RAG (TrajRAG), a retrieval-augmented generation framework that enhances large-model reasoning by retrieving geometric-semantic experiences. TrajRAG incrementally accumulates episodic observations from past navigation episodes. To structure these observations, we propose a topological-polar (topo-polar) trajectory representation that compactly encodes spatial layouts and semantic contexts, effectively removing redundancies in raw episodic observations. A hierarchical chunking structure further organizes similar topo-polar trajectories into unified summaries, enabling coarse-to-fine retrieval. During navigation, candidate frontiers generate multiple trajectory hypotheses that query TrajRAG for similar past trajectories, guiding large-model reasoning for waypoint selection. New experiences are continually consolidated into TrajRAG, enabling the accumulation of lifelong navigation experience. Experiments on MP3D, HM3D-v1, and HM3D-v2 show that TrajRAG effectively retrieves relevant geometric-semantic experiences and improves zero-shot ObjectNav performance.


[98] Linear-Time Global Visual Modeling without Explicit Attention cs.CVPDF

Ruize He, Dongchen Han, Gao Huang

TL;DR: 本文提出了一种新颖的视角,将注意力机制重新解释为一种具有动态预测参数的多层感知机(MLP),并基于此探索了完全通过动态参数化实现Transformer级别的全局序列建模的可能性,旨在以线性复杂度替代显式的二次复杂度注意力计算。

Details

Motivation: 现有研究普遍将Transformer的全局序列建模能力归因于显式计算注意力权重,该过程具有固有的二次计算复杂度。本文旨在探索能否在不使用显式注意力的情况下,仅通过动态参数化实现同等水平的全局建模,并保持线性复杂度。

Result: 在视觉模型上的大量实证研究表明,动态参数化确实可以作为一种高效、线性复杂度的替代方案,有效取代显式注意力机制。

Insight: 核心创新点在于将注意力机制重新数学框架化为动态参数化的MLP,从而将全局建模能力解释为动态生成的参数对全局上下文的隐式压缩表示,而非显式的token聚合。这为高效的序列建模开辟了新途径。

Abstract: Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention’s global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.


[99] SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages cs.CV | cs.AI | cs.CLPDF

Sen Fang, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas

TL;DR: 本文介绍了SignVerse-2M,一个包含约200万个视频片段、覆盖25种以上手语的大规模多语言姿态原生数据集,旨在为手语姿态建模与评估提供统一接口。该数据集通过DWPose统一预处理流程将原始视频转换为可直接用于建模的2D姿态序列,并提供了数据构建流程、任务定义及一个简单的SignDW Transformer基线模型。

Details

Motivation: 现有大规模手语资源通常仅提供原始视频-文本对齐的监督,且多在实验室环境下制作,缺乏面向开放世界识别、翻译以及现代姿态驱动手语视频生成框架的统一接口。RGB预训练识别模型对录制时的固定背景或服装条件依赖性强,而姿态处理模型则更具鲁棒性;同时,手语领域尚缺能与现代姿态原生范式直接对接并面向真实开放场景的数据资源。

Result: 论文构建了SignVerse-2M数据集,包含约200万个片段,覆盖25种以上手语,并通过SignDW Transformer基线模型展示了该资源在多语言姿态空间建模中的可行性及其与现代姿态驱动流程的兼容性。

Insight: 创新点在于提出了一个大规模、多语言、姿态原生的手语数据集,通过统一的DWPose姿态表示减少外观变化,保留了真实世界视频的录制条件和说话者多样性,为手语识别、生成和评估提供了可直接与现代姿态驱动框架对接的数据基础。

Abstract: Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open-world settings than style-agnostic pose-processing models. 2. Recent pose-guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose-native paradigm while also targeting real-world open scenarios. We present SignVerse-2M, a large-scale multilingual pose-native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 25 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real-world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose-space modeling and its compatibility with modern pose-driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.


[100] Motion-Aware Caching for Efficient Autoregressive Video Generation cs.CV | cs.AIPDF

Jing Xu, Yuexiao Ma, Songwei Liu, Xuzhe Zheng, Shiwei Liu

TL;DR: 本文提出了一种名为MotionCache的运动感知缓存框架,旨在加速自回归视频生成。该框架通过利用帧间差异作为像素级运动特征的轻量级代理,动态调整每个token的缓存更新频率,从而在保持生成质量的同时,显著减少了冗余去噪步骤的计算负担。

Details

Motivation: 自回归视频生成范式在生成长视频方面具有理论潜力,但其实际部署受到顺序迭代去噪计算负担的阻碍。现有的缓存重用方法依赖于粗粒度的块级跳过,无法捕捉细粒度的像素动态,这可能导致高运动像素的错误累积。

Result: 在SkyReels-V2和MAGI-1等最先进模型上的大量实验表明,MotionCache分别实现了6.28倍和1.64倍的显著加速,同时在VBench基准测试上有效保持了生成质量(分别仅下降1%和0.01%)。

Insight: 论文的核心创新在于将缓存错误与残差不稳定性理论关联,并提出了一种从粗到细的策略:初始预热阶段建立语义连贯性,随后采用运动加权的缓存重用,根据像素运动特性动态调整更新频率。这为视频生成加速提供了一种细粒度、自适应的缓存管理新思路。

Abstract: Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1%\downarrow$ and $0.01%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.


[101] GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models cs.CV | cs.AIPDF

Zeshang Li, Shuoyang Zhang, Jiashen Ding

TL;DR: 该论文提出了一种名为GEASS的训练免费模块,用于缓解视觉语言模型中的物体幻觉问题。GEASS通过动态门控机制,根据每个查询的置信度和熵减情况,选择性利用模型自生成的描述文本,从而在不增加训练成本的情况下提升模型性能。

Details

Motivation: 视觉语言模型在基础推理方面表现出色,但仍易产生物体幻觉。现有方法通常将自生成描述视为统一的正向资源,但研究发现盲目嵌入描述反而会损害性能(例如使Qwen2.5-VL-3B在HallusionBench上的准确率下降近10个百分点)。因此,需要一种能够根据查询动态评估描述有用性的方法。

Result: 在POPE和HallusionBench基准测试上,对四个视觉语言模型的实验表明,GEASS始终优于原始推理和对比解码方法,且每个查询仅需两次额外前向传播。

Insight: 论文的创新点在于揭示了描述文本的两个结构性特性:1)描述不仅锚定最终答案,还影响推理轨迹和词汇选择;2)描述错误具有不对称性(遗漏远多于虚构,但每个虚构的影响更大)。基于此,GEASS通过置信度门控、熵减加权和证据阈值提升,实现了对描述有用性的逐查询动态评估,这是一种无需训练的高效幻觉缓解策略。

Abstract: Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rather than help–dropping Qwen2.5-VL-3B accuracy on HallusionBench by nearly 10 points. Two structural properties explain this. First, captions anchor not only the model’s final answer but also its reasoning trajectory and lexical choices. Second, caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact. A caption’s usefulness is therefore a per-query property, not a per-corpus one. We propose GEASS (Gated Evidence-Aware Selective Steering), a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path’s confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.


[102] Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning cs.CVPDF

Sixian Zhang, Yiyao Wang, Xinhang Song, Keming Zhang, Zijian Xu

TL;DR: 本文提出了一种多尺度高斯-语言地图(GLMap),用于零样本具身导航与推理任务。该方法通过结合显式几何表示、多尺度语义(涵盖实例和区域概念)以及双模态接口(每个语义单元同时存储自然语言描述和3D高斯表示),解决了现有语义地图方法在几何与语义权衡、以及与大模型接口兼容性方面的不足。GLMap支持通过高斯泼溅进行紧凑存储和快速渲染,并引入高斯估计器实现高效增量构建。

Details

Motivation: 现有语义地图方法在显式几何与多尺度语义之间难以兼顾,且缺乏与大模型的原生接口,导致需要额外训练特征投影以实现语义对齐。

Result: 在ObjectNav、InstNav和SQA任务上的实验表明,GLMap有效提升了目标导航和上下文推理性能,并以零样本方式保持与基于大模型方法的兼容性。

Insight: 创新点在于将3D高斯表示与自然语言描述结合形成双模态语义单元,实现了紧凑、可渲染的显式几何与多尺度语义的统一表示;同时,提出的解析式高斯估计器避免了基于梯度的优化,实现了高效增量建图。

Abstract: Understanding the geometric and semantic structure of environments is essential for embodied navigation and reasoning. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics, and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian parameters from dense point clouds without gradient-based optimization. Experiments on ObjectNav, InstNav, and SQA tasks show that GLMap effectively enhances target navigation and contextual reasoning, while remaining compatible with large-model-based methods in a zero-shot manner. The code is available at https://github.com/sx-zhang/GLMap.


[103] Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging cs.CVPDF

Phat Nguyen, Xue Geng, Kaixin Xu, Wang Zhe, Xulei Yang

TL;DR: 本文提出了一种针对视觉Transformer(ViT)的多轴联合优化框架,旨在同时优化架构、令牌和位宽三个维度,以在保持精度的前提下显著提升模型在资源受限工业环境中的部署效率。

Details

Motivation: 解决ViT在资源受限工业环境中部署时面临的高计算成本、内存需求和能耗问题,现有方法通常仅针对单一优化轴,限制了整体部署收益。

Result: 在ImageNet-1K上分析精度-效率权衡后,应用于内部3D X射线半导体缺陷分类数据集,实现了吞吐量提升超过10倍,参数量、FLOPs和能耗降低超过10倍,同时在下游工业任务中保持所需精度。

Insight: 创新点在于首次将架构搜索(AutoFormer)、令牌合并(ToMe)和混合精度推理(fp16)三个互补优化轴联合优化,为半导体制造等工业场景提供了资源高效的部署方案。

Abstract: Vision Transformers (ViTs) have achieved strong performance in visual recognition, yet their deployment in resource-constrained industrial environments remains limited. Some main challenges are their high computational cost, memory requirement, and energy consumption. While individual efficiency techniques such as neural architecture search (NAS), token compression, and low-precision inference have been extensively studied, most prior work targets only a single optimization axis, limiting overall deployment gains while preserving accuracy. In this paper, we present one of the first holistic frameworks that jointly optimizes three complementary axes: architecture, token, and bit-width. Specifically, the framework identifies compact backbones via Neural Architecture Search (AutoFormer), reduces information processing via token merging (ToMe), and accelerates per-operation execution via fp16 mixed-precision inference. Starting from a DeiT-B/16 baseline, we first analyze accuracy-efficiency trade-offs on ImageNet-1K under aggressive compression. Then, we apply the selected configurations to a real-world in-house 3D X-ray semiconductor defect classification dataset for IC chip packaging inspection. Results show that the proposed multi-axis framework achieves more than 10 times improvement in throughput along with over 10 times reductions in parameter count, FLOPs, and energy consumption, while maintaining the required accuracy on the downstream industrial task. To the best of our knowledge, this is among the earliest works to jointly optimize architecture, token, and bit-width dimensions in ViTs and the first such resource-efficient, deployment-focused study tailored to semiconductor manufacturing.


[104] Profile-Specific 3DMM Regression from a Single Lateral Face Image cs.CVPDF

Taiki Kanaya, Hideo Saito

TL;DR: 本文提出了一种从单张侧面RGB图像进行3D人脸重建的方法,旨在解决传统正畸头影测量分析依赖X射线成像的问题。通过引入基于几何条件生成的合成数据集ProfileSynth和一个针对侧面视图的FLAME模型回归基线,该方法利用深度和法线图条件扩散模型生成逼真侧面图像,并结合可见性感知的下颌线正则化,为侧面视图下的3D可变形模型重建提供了实用基线。

Details

Motivation: 解决从极端侧面视角(偏航角约90°)的单张RGB图像进行准确3D人脸重建的挑战,以替代有辐射风险的X射线成像,用于临床头影测量分析,同时弥补现有基于学习的3DMM回归方法主要在近正面图像上训练、难以处理侧面遮挡的不足。

Result: 论文提出了ProfileSynth合成数据集和一种针对侧面的FLAME回归基线方法,为’侧面视图×3DMM’重建任务建立了实用基线,有望为基于侧面RGB图像的非侵入式头影测量分析提供更准确的基础。

Insight: 创新点在于通过几何条件(深度和法线图)驱动扩散模型生成极端偏航角下的逼真合成数据,以解决侧面视图数据稀缺问题,并设计了可见性感知的下颌线正则化来利用侧面图像中关键的轮廓几何信息,为遮挡严重的视角重建提供了新思路。

Abstract: Single-image 3D face reconstruction is a core problem in computer vision, with important clinical applications such as cephalometric landmark analysis in orthodontics. Traditionally, this analysis relies on lateral X-ray imaging; however, frequent X-ray exposure is impractical due to radiation concerns. While recent research has explored detecting landmarks from lateral RGB images as an alternative, existing methods typically rely on 2D features such as the eyes, mouth, ears, and boundary silhouettes, failing to fully exploit the underlying 3D facial geometry spanning the facial profile and jawline, which is essential for accurate diagnosis. Meanwhile, although 3D face reconstruction from frontal views has seen significant progress, most learning-based 3D morphable model (3DMM) regressors are developed and benchmarked on near-frontal images, where appearance cues are abundant. In extreme profile views (yaw $\approx 90^\circ$), much of the face is occluded, and the available signal is dominated by boundary cues, making accurate 3D reconstruction challenging. In this paper, we bridge this gap with geometry-conditioned synthetic data and a simple profile-specific FLAME regression baseline for single lateral images. We introduce ProfileSynth, a dataset created by sampling FLAME shape and pose parameters in extreme yaw ranges and generating photorealistic profile images using a diffusion model conditioned on depth and normal maps. We further study a profile-specific baseline with visibility-aware jawline regularization. Our framework provides a practical baseline for “profile $\times$ 3DMM” reconstruction and a promising foundation for more accurate, non-invasive cephalometric analysis from lateral RGB images.


[105] PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning cs.CVPDF

Xinxing Yu, Ajian Liu, Sunyuan Qiang, Hui Ma, Liying Yang

TL;DR: 本文提出了一种基于跨样本语义传播(CSP)的点云自监督学习框架PointCSP,通过将批次内的样本序列化并由状态空间模型处理,实现跨样本的语义状态传播,以构建统一且可迁移的语义空间。同时,在微调阶段引入了非对称语义保持蒸馏(SPD),通过异构输入机制和语义特征对齐约束,确保预训练语义的稳定迁移,从而在单场景测试条件下保持结构化的语义一致性和鲁棒性。

Details

Motivation: 现有场景级点云自监督学习方法采用样本独立的建模范式,难以跨场景保持一致的语义表示,阻碍了统一且可迁移的语义空间的构建。

Result: 在多个基准数据集上的大量实验表明,该方法在性能和语义一致性方面均持续优于最先进的方法。

Insight: 创新点在于引入了跨样本语义传播机制,通过序列化批次样本和状态空间模型显式建模样本间的动态依赖,以及设计非对称语义保持蒸馏来消除批次依赖性带来的不一致,实现了全局语义对齐和稳定语义迁移。

Abstract: Scene-level point cloud self-supervised learning (PC-SSL) has demonstrated potential in enhancing the generalization capability of 3D vision models. Despite the advances in the field through existing methods, the sample-independent modeling paradigm still poses significant limitations in terms of maintaining consistent semantic representations across scenes. This challenge hinders the construction of a unified and transferable semantic space. To address this issue, we propose a PC-SSL framework based on cross-sample semantic propagation (CSP), in which samples within a batch are serialized into continuous input and processed by a state-space model to enable semantic state propagation. This mechanism explicitly models the dynamic dependencies across samples in the state space, allowing the network to establish cross-sample semantic consistency in the latent space and achieve global semantic alignment. Since serialization-based pretraining requires batch-level input organization, we further introduce an asymmetric semantic preservation distillation (SPD) during finetuning to achieve structural alignment of semantic transfer and eliminate inconsistencies caused by batch dependency. The proposed SPD ensures stable transfer of pretrained semantics through a heterogeneous input mechanism and a semantic feature alignment constraint. This enables the model to maintain structured semantic consistency and robustness under single-scene testing conditions. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms state-of-the-art methods in both performance and semantic consistency.


[106] TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks cs.CVPDF

Quanchen Zou, Nizhang Li, Wenxin Zhang, Jiaye Lin, Yangchen Zeng

TL;DR: 本文提出了一种名为TrajShield的免训练推理时防御框架,用于保护文本到视频(T2V)模型免受越狱攻击。该框架将T2V安全问题重新定义为对时间结构化语义空间的因果干预,通过模拟提示的隐含轨迹、定位潜在风险的因果起源,并应用最小侵入式重写来消除风险,从而统一处理显式不安全提示、越狱攻击和时间性涌现风险。

Details

Motivation: 现有基于提示词层面的防御方法主要继承自文本到图像模型,仅作用于输入的词汇表面,容易受到通过改写或对抗性提示来伪装有害意图的越狱攻击。此外,T2V生成引入了一个先前工作忽视的独特挑战:时间性涌现风险,即看似良性的提示可能因生成器为追求叙事连贯性而进行时间外推,最终导致不安全内容。

Result: 在涵盖14个安全类别和多个T2V后端模型的T2VSafetyBench基准测试上,TrajShield实现了最先进的防御性能,同时保持了高语义保真度,显著优于现有防御方法,平均攻击成功率降低了52.44%。

Insight: 论文的核心创新在于将T2V安全防御问题重新定义为对时间结构化语义空间的因果干预,并提出了一个统一的框架来处理多种风险类型。其关键洞见在于通过模拟和干预隐含的语义轨迹来定位和化解风险,这是一种超越传统词汇层面防御的、更具鲁棒性的方法。

Abstract: Text-to-Video (T2V) models have demonstrated remarkable capability in generating temporally coherent videos from natural language prompts, yet they also risk producing unsafe content such as violence or explicit material. Existing prompt-level defenses are largely inherited from text-to-image safety and operate on the lexical surface of the input, making them vulnerable to jailbreak attacks that disguise harmful intent through rephrasing or adversarial prompting. Moreover, T2V generation introduces a distinctive challenge overlooked by prior work: temporally emergent risk, where a seemingly benign prompt leads to unsafe content through the generator’s temporal extrapolation toward narrative coherence. We propose \method{}, a training-free, inference-time defense framework that reformulates T2V safety as a causal intervention in a temporally structured semantic space. TrajShield handles explicit unsafe prompts, jailbreak attacks, and temporally emergent risks in a unified manner by simulating the implied trajectory of a prompt, localizing the causal origin of potential risk, and applying a minimally invasive rewrite that neutralizes the risk while preserving safety-irrelevant semantics. Experiments on T2VSafetyBench across 14 safety categories and multiple T2V backends demonstrate that TrajShield achieves state-of-the-art defenseive performance while maintaining high semantic fidelity, substantially outperforming existing defenses, with an average ASR reduction of 52.44%.


[107] MedScribe: Clinically Grounded CT Reporting through Agentic Workflows cs.CVPDF

Giuseppe A. Orlando, Paolo Papotti, Maria A. Zuluaga, Olivier Humbert, Marco Lorenzi

TL;DR: 本文提出了MedScribe,一个用于自动化生成CT影像报告的假设驱动框架。该框架将报告生成重新定义为迭代的证据获取过程,而非单次编码任务。它利用大语言模型动态调用特定病理诊断工具来提取局部体积特征,并通过查询与病理文本证据对齐的多维检索空间来积累定量证据,从而减少幻觉并增强临床准确性。

Details

Motivation: 现有基于视觉语言模型的放射学报告生成方法依赖于体数据的全局嵌入压缩,这常常导致幻觉性发现以及在3D CT成像中解剖学依据不足的问题。本文旨在通过假设驱动的推理过程来解决这些问题,提升报告的可靠性和可解释性。

Result: 在无需任务特定微调的情况下,MedScribe在CT-RATE和RadChestCT基准测试上,相比最先进的2D和3D视觉语言模型,在临床准确性、事实一致性和可解释性方面均取得了提升。

Insight: 主要创新点在于将报告生成重构为基于代理工作流的顺序决策过程,通过动态工具调用和结构化特征检索实现细粒度证据积累。这为可靠医学图像报告提供了一种可解释的、假设驱动的推理新范式,其模块化设计可能泛化至其他需要精确证据支持的视觉语言任务。

Abstract: Vision-language models (VLMs) have shown potential for automated radiology report generation, yet existing approaches rely on global embedding compression of volumetric data, often leading to hallucinated findings and limited anatomical grounding in 3D CT imaging. We introduce MedScribe, a hypothesis-driven framework that reformulates report generation as an iterative evidence acquisition process rather than a single-pass encoding task. MedScribe models reporting as a sequential decision process in which a large language model dynamically invokes pathology-specific diagnostic tools to extract localized volumetric features. These structured features are used to query a multidimensional retrieval space aligned with pathology-specific textual evidence. By explicitly accumulating quantitative evidence prior to synthesis, the framework enforces fine-grained grounding and reduces unsupported claims. Without task-specific fine-tuning, MedScribe improves clinical accuracy, factual consistency, and interpretability on CT-RATE and RadChestCT compared to state-of-the-art 2D and 3D VLMs, demonstrating the value of hypothesis-driven reasoning for reliable medical image reporting.


[108] Embody4D: A Generalist 4D World Model for Embodied AI cs.CVPDF

Peiyan Tu, Hanxin Zhu, Jingwen Sun, Shaojie Ren, Cong Wang

TL;DR: Embody4D是一个专门为具身AI场景设计的视频到视频世界模型,能够从单目视频合成任意新视角的视频。它通过构建异构数据集、设计自适应噪声注入策略以及引入交互感知注意力机制,解决了多视角数据稀缺、生成3D几何时空一致性差以及操作细节易产生幻觉等挑战,从而合成高保真、视角一致的视频,以支持下游机器人规划与学习。

Details

Motivation: 当前大多数具身世界模型仍局限于2D表示,缺乏对具身空间推理至关重要的全面多视角信息。构建此类模型的挑战主要来自配对多视角数据的严重稀缺、生成3D几何中保持时空一致性的困难,以及模型容易对操作细节产生幻觉。

Result: 大量实验表明,Embody4D在合成高保真、视角一致的视频方面取得了最先进的性能,作为一个强大的世界模型,能够赋能下游机器人规划与学习。

Insight: 论文的创新点包括:1)提出一个3D感知的组合合成流水线来构建异构数据集,以解决数据稀缺并保证广泛泛化;2)设计一种自适应噪声注入策略,利用图像区域间的置信度差异选择性地正则化扩散过程,以确保严格的时空一致性;3)引入一个交互感知注意力机制,明确关注机器人交互区域以保证操作保真度。这些方法为解决具身世界模型中的关键挑战提供了系统性方案。

Abstract: World models have made significant progress in modeling dynamic environments; however, most embodied world models are still restricted to 2D representations, lacking the comprehensive multi-view information essential for embodied spatial reasoning. Bridging this gap is non-trivial, primarily due to challenges from severe scarcity of paired multi-view data, the difficulty of maintaining spatiotemporal consistency in generated 3D geometries, and the tendency to hallucinate manipulation details. To address these challenges, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios, capable of synthesizing arbitrary novel views from a monocular video. First, to tackle data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, guaranteeing broad generalization. Second, to enforce geometric stability, we devise an adaptive noise injection strategy; by leveraging confidence disparities across image regions, this method selectively regularizes the diffusion process to ensure strict spatiotemporal consistency. Finally, to guarantee manipulation fidelity, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments demonstrate that Embody4D achieves state-of-the-art performance, serving as a robust world model that synthesizes high-fidelity, view-consistent videos to empower downstream robotic planning and learning.


[109] Cross-Domain Adversarial Augmentation: Stabilizing GANs for Medical and Handwriting Data Scarcity cs.CVPDF

Md. Sohanuzzaman Soad, Mahady Al Hady, S M Rafiuddin Rifat, Sudip Ghose

TL;DR: 本文研究了在数据稀缺的视觉任务中,使用生成对抗网络(GANs)进行跨领域生成增强的方法。作者在孟加拉语手写字符和胸部X射线图像两个低资源领域,使用DCGAN风格的模型(64x64分辨率)进行实验,通过Inception Score(IS)、Fr’echet Inception Distance(FID)和嵌入可视化(t-SNE/UMAP)评估生成数据的保真度和多样性,并通过在真实与合成数据上训练分类器来评估下游效用。实验表明,生成增强能提高样本多样性,并在有限数据条件下提升分类器性能。作者还分析了稳定性增强技术(如梯度惩罚目标和谱归一化),并讨论了医学图像评估的注意事项、数据集许可以及合成数据的隐私风险。

Details

Motivation: 解决计算机视觉任务中,特别是在孟加拉语手写字符和胸部X射线图像等低资源领域,因数据稀缺而导致模型性能受限的问题。

Result: 在有限数据条件下,生成增强提高了样本多样性,并一致提升了分类器性能;实验使用了IS、FID和嵌入可视化进行评估,但未明确提及是否达到SOTA水平,而是提供了一个可复现的强基线。

Insight: 创新点包括跨领域应用生成增强以缓解数据稀缺,并系统分析了稳定性增强技术(如梯度惩罚和谱归一化)对GAN训练的影响;客观来看,该方法为资源受限的成像任务提供了一个简单可复现的协议,强调了评估中的注意事项和隐私风险,具有实际借鉴价值。

Abstract: Generative Adversarial Networks (GANs) offer a pragmatic route to mitigate data scarcity in vision tasks. We study generative augmentation across two low-resource domains: Bangla handwritten characters and chest X-ray imaging using DCGAN-style models trained at 64x64 resolution. We evaluate fidelity and diversity via Inception Score (IS), Fr’echet Inception Distance (FID), and embedding visualizations (t-SNE/UMAP), and assess downstream utility by training classifiers on real versus real synthetic data. Our experiments show that generative augmentation improves sample diversity and yields consistent gains in classifier performance under limited-data regimes. We analyze stability enhancements (e.g., gradient-penalized objectives and spectral normalization) and report ablations on synthetic-to-real ratios and sample filtering. We discuss evaluation caveats for medical images, dataset licensing, and privacy risks associated with synthetic data. The resulting protocol is simple to reproduce and provides a strong baseline for applying generative augmentation to resource-constrained imaging tasks.


[110] Hybrid Visual Telemetry for Bandwidth-Constrained Robotic Vision: A Pilot Study with HEVC Base Video and JPEG ROI Stills cs.CV | cs.LG | cs.ROPDF

Natalia Trukhina, Vadim Vashkelis

TL;DR: 本文提出了一种面向带宽受限机器人视觉的混合视觉遥测方案,该方案结合了连续的低比特率HEVC基础视频流和选择性传输的高细节JPEG静态感兴趣区域(ROI)图像,以同时支持动态场景理解和精细目标识别。

Details

Motivation: 解决单一压缩视频流在带宽受限系统中难以兼顾连续场景感知和精细机器感知需求的问题,即低比特率视频保留运动和粗粒度上下文但丢失局部细节,而混合架构可弥补这一缺陷。

Result: 在面向无人机的数据集和两种实际比特率条件下,通过实验协议比较了纯视频方案与混合方案在相同总通信预算下的性能,结果表明混合方案能通过ROI静态图像提升目标级分类精度。

Insight: 创新点在于形式化并验证了混合传输范式(基础视频+ROI静态图像)的可行性,为后续研究(如采用JPEG AI作为语义静态图像通道)奠定了方法学基础,而非提出新的静态图像编解码器。

Abstract: Bandwidth-constrained robotic and surveillance systems often rely on a single compressed video stream to support both continuous scene awareness and downstream machine perception. In practice, this creates a mismatch: low-bitrate video can preserve motion and coarse context, but often loses the fine local detail needed for reliable object recognition and decision-making. Motivated by a hybrid architecture in which low-resolution video supports dynamic scene understanding while eventdriven high-detail regions of interest (ROIs) support close-up identification and analytics, this paper formalizes a two-channel visual telemetry scheme in which a continuous low-bitrate video stream is augmented by selectively transmitted high-detail still ROIs. This first paper does not attempt to prove the superiority of a new still-image codec. Instead, it establishes the hybrid transmission paradigm itself using a practical and reproducible codec stack: x265/HEVC for the base video stream and JPEG stills for ROI refinement. We formulate the problem as bitrate-constrained information selection for robotic vision and define an experimental protocol in which video-only and hybrid schemes are compared under matched total communication budgets. The study is designed around UAV-oriented datasets, two practical bitrate regimes, several ROI triggering policies, and object-level classification refinement on selectively transmitted ROI stills. The resulting paper lays the methodological foundation for a second-stage investigation of JPEG AI as the semantic still-image channel within the same hybrid architecture.


[111] Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering cs.CVPDF

Yun Xing, Hanyuan Liu, Jiahao Nie, Shijian Lu

TL;DR: 本文提出了一种无需训练的方法CSteer,通过上下文潜在引导,使通用大型多模态模型能够同时参考多个图像区域进行上下文感知的视觉指代,无需微调或修改架构。

Details

Motivation: 现有大型多模态模型在全图理解上表现良好,但在基于视觉提示进行区域级感知、尤其是同时参考多个区域或需要全局上下文进行精确视觉指代时存在困难。

Result: 在多个数据集上的实验表明,结合CSteer的通用LMM在大多数情况下优于专门设计的指代LMM,为该领域设定了新的最先进水平。

Insight: 创新点在于通过预计算表示视觉指代行为的上下文向量(如区域区分和全局上下文关注),并在推理时进行表征编辑,实现了训练免费的多区域上下文指代能力提升。

Abstract: Large Multimodal Models (LMMs) have recently demonstrated their proficiency in holistic visual comprehension. However, most of them struggle to tackle region-level perception guided by visual prompts, especially for cases where multiple regions are referred simultaneously, or scenarios where global contexts are necessary for precise visual referring. We introduce Contextual Latent Steering (CSteer), a training-free approach for guiding general LMMs to refer multiple regions contextually, without expensive fine-tuning or architectural modifications. CSteer starts with pre-computing contextual vectors that implicitly represent visual referring behaviors, such as differentiation among regions and attention to global contexts, followed by representation editing during inference time. Experimental results on multiple datasets indicate that general LMMs with CSteer outperform tailored referring LMMs in most cases, suggesting a promising solution in training-free, and setting new state-of-the-art for this field. Code is available at https://github.com/xing0047/csteer.git.


[112] DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity cs.CVPDF

Lilika Makabe, Kohei Ashida, Hiroaki Santo, Fumio Okura, Yasuyuki Matsushita

TL;DR: 该论文提出了一种名为DP-SfM的新方法,利用双像素(DP)传感器捕获的多视图图像来解决多视图3D重建中的尺度模糊问题,无需参考物体或先验标定。该方法通过分析DP图像中的散焦模糊信息,结合从多视图重建得到的深度图(带尺度模糊),首先使用线性方法估计绝对尺度,然后通过基于强度的优化阶段对齐左右DP图像。

Details

Motivation: 多视图3D重建(如运动恢复结构和多视图立体)是3D计算机视觉的基础,但通常存在未知的尺度模糊问题,除非场景中包含已知尺寸的参考物体。论文旨在利用DP传感器的特性自动解决这一尺度模糊,无需额外参考或校准。

Result: 实验表明,该方法在不同相机和镜头捕获的多样化场景中均有效,代码和数据已开源。

Insight: 创新点在于首次利用DP图像的散焦模糊信息来解析多视图重建的绝对尺度,提出了一种结合线性估计和强度优化的简单有效方法,为无参考物体的3D重建提供了新思路。

Abstract: Multi-view 3D reconstruction, namely, structure-from-motion followed by multi-view stereo, is a fundamental component of 3D computer vision. In general, multi-view 3D reconstruction suffers from an unknown scale ambiguity unless a reference object of known size is present in the scene. In this article, we show that multi-view images captured using a dual-pixel (DP) sensor can automatically resolve the scale ambiguity, without requiring a reference object or prior calibration. Specifically, the defocus blur observed in DP images provides sufficient information to determine the absolute scale when paired with depth maps (up to scale) recovered from multi-view 3D reconstruction. Based on this observation, we develop a simple yet effective linear method to estimate the absolute scale, followed by the intensity-based optimization stage that aligns the left and right DP images by shifting them back toward each other using cross-view blur kernels. Experiments demonstrate the effectiveness of the proposed approach across diverse scenes captured with different cameras and lenses. Code and data are available at https://github.com/lilika-makabe/dp-sfm-tpami.git


[113] High-Fidelity Mobile Avatars with Pruned Local Blendshapes cs.CV | cs.GRPDF

Youyi Zhan, He Wang, Tianjia Shao, Kun Zhou

TL;DR: 本文提出了一种从多视角视频重建高保真人体化身的方法,该方法能够在移动设备上实时运行。通过使用局部线性混合形状来捕捉高斯属性的全局非线性变化,并移除变化较小的高斯混合形状以减少计算量和模型大小,实现了端到端训练且无需预训练模型。

Details

Motivation: 现有基于高斯建模的高质量全身化身方法计算量大,难以在移动设备上部署;而现有移动端方法通过从预训练模型蒸馏和全局姿态特征线性组合建模,会损失细节。本文旨在实现移动端高保真化身渲染,平衡细节质量与计算效率。

Result: 实验表明,该方法能渲染出细节更好的高质量人体化身,在移动设备上以2K分辨率达到120 FPS。

Insight: 创新点在于观察到身体局部区域内的高斯点高度相关,可用局部线性混合形状建模全局非线性变化,并通过剪枝变化小的混合形状实现最小化表示;无需预训练模型的端到端训练以及WebGPU实现确保了跨设备部署能力。

Abstract: We propose a method to reconstruct high-fidelity human avatars from multi-view video that can run on mobile devices. Many works can model high-quality Gaussian-based full-body avatars from multi-view video. However, these methods require heavy computation to obtain pose-dependent appearance, making deployment on mobile devices very difficult. Recent methods distill from pretrained models and model pose-dependent nonlinear Gaussian attributes by linearly combining global pose features with blendshapes. Although they can run on mobile devices, they suffer some loss of detail. We observe that nearby Gaussians are often highly correlated within a local region of the body, and can be linearly modeled with less error. Therefore, we use local linear blendshapes in small body parts to capture global nonlinear changes of Gaussian attributes. To further reduce computation and model size, we propose to remove blendshapes for Gaussians whose attributes change little, yielding a minimal blendshape representation. Our method is an end-to-end training method without a pretrained model. To make it run on multiple devices, we implement our method using WebGPU. Experiments show that our method can render high-quality human avatars with better details, and can reach 120 FPS at 2K resolution on mobile devices.


[114] Decouple and Cache: KV Cache Construction for Streaming Video Understanding cs.CVPDF

Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener, Angela Yao

TL;DR: 本文提出了一种名为解耦流缓存(DSCache)的训练免费缓存构建机制,用于将预训练的离线视频理解模型适配到流式视频处理场景。该方法通过维护累积的历史键值缓存和按需构建的即时缓存,解决了无限视频流处理中的内存限制和位置外推问题,在流式视频问答基准测试中实现了最先进的性能。

Details

Motivation: 流式视频理解需要以有限的内存和计算资源处理无限长的视频流,面临两个关键挑战:一是需要持续构建新缓存并淘汰旧缓存;二是模型必须从短序列训练中学习并泛化到长流。现有方法未能有效扩展到无限流或主要关注缓存重用,而缓存构建的影响未被充分探索。

Result: 在流式视频问答基准测试上,DSCache实现了最先进的性能,平均准确率比先前方法提升了2.5%。

Insight: 创新点包括:1)解耦的缓存结构,分离累积历史缓存和即时缓存,以保持近期输入的信息性;2)位置无关的编码策略,支持超出训练长度的位置外推,防止位置溢出。这些机制无需额外训练即可适配预训练模型,提升了流式处理的效率和泛化能力。

Abstract: Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old key-value(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. To enable position extrapolation beyond the training length, DSCache further incorporates a position-agnostic encoding strategy, ensuring KV caches to support unseen positions and preventing position overflow. Experiments on Streaming Video QA benchmarks demonstrate DSCache’s state-of-the-art performance, with an average 2.5% accuracy gains over prior methods.


[115] BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton cs.CV | cs.AIPDF

Kuoye Niu, Jianwei Li, Shengze Cai, Yong Ma, Mengyao Jia

TL;DR: BadmintonGRF是一个用于羽毛球运动中无标记地面反作用力估计的多模态数据集与基准。它同步采集了约120 FPS的八视角RGB视频、四个Kistler测力台和Vicon运动捕捉数据,并通过人工验证事件、自动化质量保证和带不确定度元数据的相机时间偏移进行对齐。数据集分为两个层级:Tier 1提供姿态、时间对齐的GRF和元数据,用于基准测试;Tier 2在受控访问下提供原始RGB和运动捕捉数据。公共版本包含17,425个击球片段,基准测试加载器处理后保留12,867个视图特定实例和1,732个唯一击球。

Details

Motivation: 解决非周期性球场运动(如羽毛球)中,缺乏实验室级传感的多模态公开数据集的问题,特别是缺少将仪器测量的地面反作用力与高帧率多视角视频配对的数据,这限制了在真实训练环境中进行无标记负荷估计的研究。

Result: 论文提出了一个基准任务,将2D姿态映射到GRF,并提供了十个参考基线模型和可选的后期融合方法。在包含10名受试者的基准数据集上进行了评估,具体定量结果未在摘要中明确给出,但该数据集本身旨在为相关研究建立新的基准。

Insight: 创新点在于构建了首个结合了多视角视频、测力台和运动捕捉的公开羽毛球数据集,并通过精心设计的时间对齐和元数据确保了多模态数据的同步质量。数据集的层级化发布策略(Tier 1和Tier 2)平衡了开放性与数据安全,为无标记GRF估计研究提供了宝贵的资源。从客观角度看,其多模态同步与对齐方法,以及针对击球事件的中心化数据组织,是值得借鉴的技术细节。

Abstract: Multimodal resources for non-periodic court sports with laboratory-grade sensing remain scarce: few publicly pair instrumented ground reaction force (GRF) with high-frame-rate multi-view video, limiting markerless load estimation in realistic training settings. BadmintonGRF records eight synchronized RGB views at ~120 FPS, four Kistler force plates, and Vicon motion capture (C3D) without hardware genlock across modalities; alignment combines human-verified events, automated quality assurance, and per-camera time offsets with uncertainty metadata. Tier 1 distributes pose, time-aligned GRF, metadata, and splits under CC BY-NC 4.0, enabling the primary benchmark without raw RGB or C3D; we report a Tier 1 task that maps 2D pose to GRF. Tier 2 provides raw RGB and C3D under controlled access for studies that require appearance or full kinematics. The public release contains 17,425 impact-segment archives in the 10-subject benchmark tree (156 instrumented trials; raw multi-view RGB alone exceeds 1 TB); benchmark loader gates retain 12,867 view-specific instances and 1,732 unique impacts after multi-view deduplication. We are not aware of prior public badminton corpora that combine this sensing layout with audited video–GRF alignment for impact-centric GRF estimation. We distribute preprocessing code, leave-one-subject-out splits, ten reference baselines, and optional late fusion (one deterministic test-time pass per instance; no test-time augmentation), with a within-trial diagnostic in the supplementary material.


[116] Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts cs.CV | cs.AI | cs.LGPDF

Hongkun Pan, Yuwei Wu, Wanyi Hong, Shenghui Hu, Qitong Yan

TL;DR: 本文提出Chart-FR1模型,一种视觉焦点驱动的细粒度图表推理方法,旨在解决多模态大语言模型在处理高信息密度图表时面临的细粒度感知不足、视觉信息冗余和缺乏自适应深度推理等挑战。

Details

Motivation: 现有MLLMs在包含多个子图、图例和密集标注的高信息密度图表上表现不佳,主要受限于细粒度感知缺失、冗余视觉信息干扰以及推理深度无法自适应调整。

Result: 在多个图表基准测试上的广泛实验表明,Chart-FR1在图表理解和推理任务上超越了当前最先进的多模态大语言模型。

Insight: 创新点包括:提出Focus-CoT视觉焦点思维链以增强细粒度感知;引入Focus-GRPO强化学习算法,通过信息效率奖励压缩冗余信息,并采用自适应KL惩罚机制动态控制推理深度;构建了HID-Chart基准数据集来评估高信息密度图表的推理能力。

Abstract: Multimodal large language models (MLLMs) have shown considerable potential in chart understanding and reasoning tasks. However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited fine-grained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus-driven fine-grained chart reasoning model, Chart-FR1, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. Specifically, we propose Focus-CoT, a visual focusing chain-of-thought that enhances fine-grained perception by explicitly linking reasoning steps to key visual cues, such as local image regions and OCR signals. Building on this, we introduce Focus-GRPO, a focus-driven reinforcement learning algorithm with an information-efficiency reward that compresses redundant visual information for efficient focusing, and an adaptive KL penalty mechanism that enables flexible control over reasoning depth as more visual cues are discovered. Furthermore, to fill the gap in benchmarks for HID charts, we build HID-Chart, a challenging benchmark with an information-density metric designed to evaluate fine-grained chart reasoning capabilities. Extensive experiments on multiple chart benchmarks demonstrate that Chart-FR1 outperforms state-of-the-art MLLMs in chart understanding and reasoning. Code is available at https://github.com/phkhub/Chart-FR1.


[117] AFFormer: Adaptive Feature Fusion Transformer for V2X Cooperative Perception under Channel Impairments cs.CV | cs.AIPDF

Xi Zhou, Tao Huang, Qing-Long Han, Rana Abbas, Mostafa Rahimi Azghadi

TL;DR: 本文提出了一种名为AFFormer的自适应特征融合Transformer框架,旨在增强V2X协同感知在信道损伤(如噪声、衰落和干扰)下的鲁棒性。该框架通过建模时间、智能体间和空间相关性来减轻受损特征的不利影响,并引入了多智能体与时间聚合、双重空间注意力以及不确定性引导融合等关键模块,结合师生知识蒸馏策略进一步提升性能。

Details

Motivation: 解决V2X协同感知中因信道损伤(如噪声、衰落和干扰)导致特征退化的问题,以提高智能交通系统的可靠性。

Result: 在V2XSet和DAIR-V2X数据集上验证,AFFormer在理想和受损通信条件下均优于现有方法,表现出对通信引起的特征退化的更强鲁棒性,同时保持了竞争性的效率-精度权衡。

Insight: 创新点包括:基于Transformer的框架整合时间、智能体间和空间相关性建模;引入不确定性引导融合进行熵驱动特征细化;采用师生知识蒸馏对齐融合特征与可靠早期协同监督,以增强鲁棒性。

Abstract: Accurate 3D object detection is essential for ensuring the safety of autonomous vehicles. Cooperative perception, which leverages vehicle-to-everything (V2X) communication to share perceptual data, enhances detection but is vulnerable to channel impairments, such as noise, fading, and interference. To strengthen the reliability of intelligent transportation systems, this work improves the robustness of V2X cooperative perception under communication conditions that reflect common channel impairments. This paper proposes an Adaptive Feature Fusion Transformer (AFFormer), a Transformer-based framework that mitigates the adverse effects of corrupted features by modeling temporal, inter-agent, and spatial correlations. AFFormer introduces three key modules: Multi-Agent and Temporal Aggregation for context-aware fusion across agents and over time, Dual Spatial Attention for efficient modeling of spatial dependencies, and Uncertainty-Guided Fusion for entropy-driven refinement of fused features. A teacher-student knowledge distillation strategy further enhances robustness by aligning fused features with reliable early-collaboration supervision. AFFormer is validated on the V2XSet and DAIR-V2X datasets, where it consistently outperforms existing methods under both ideal and impaired communication conditions, demonstrating improved robustness to communication-induced feature degradation while maintaining a competitive efficiency-accuracy trade-off.


[118] Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CVPDF

Junyuan Xiao, Dingkang Liang, Xin Zhou, Yixuan Ye, Tongtong Su

TL;DR: 该论文提出了一种名为M^2-REPA的解耦表示对齐方法,专门用于多模态世界模型中的视频生成。其核心思想是将扩散模型中间表示中的模态特定特征解耦,并分别与对应模态的专家基础模型对齐,通过多模态表示对齐损失和模态特定解耦正则化两个协同目标,充分利用多个基础模型的先验知识,从而提升生成视频的视觉质量和长期一致性。

Details

Motivation: 现有新兴的多模态世界模型试图联合生成多种模态(如RGB、深度和掩码)的视频,但未能充分利用现有基础模型的丰富先验知识。

Result: 大量实验表明,该方法在视觉质量和长期一致性方面显著优于基线模型,但摘要中未提及具体的基准测试名称或是否达到SOTA水平。

Insight: 论文宣称的创新点在于首次提出了针对多模态视频生成的表示对齐方法,其关键见解是将不同模态空间训练的基础模型视为互补的“专家”,并通过解耦和对齐机制联合优化。从客观角度看,该方法通过特征解耦和专家对齐的设计,有效整合了多模态先验,是多模态生成领域一个有价值的架构创新。

Abstract: Emerging multi-modal world models attempt to jointly generate videos across diverse modalities (e.g., RGB, depth, and mask), yet they fail to fully exploit the rich priors of existing foundation models. We propose $M^2$-REPA, the first representation alignment method tailored for multi-modal video generation. Our key insight is that foundation models trained on different modality spaces naturally capture distinct domain-specific priors, acting as complementary “experts.” Specifically, we first decouple modality-specific features from the diffusion model’s intermediate representations, then align each with its corresponding expert foundation model. To this end, we design two synergistic objectives: a multi-modal representation alignment loss that enforces feature-to-expert matching, and a modality-specific decoupling regularization that encourages complementarity across different modalities. This design enables joint optimization, fully exploiting priors from multiple foundation models. Extensive experiments demonstrate that our method significantly outperforms baselines in visual quality and long-term consistency.


[119] Behavior-Grounded Lane Representation Learning for Multi-Task Traffic Digital Twins cs.CV | cs.AIPDF

Rei Tamaru, Pei Li, Bin Ran

TL;DR: 本文提出了GeoLaneRep,一个基于行为驱动的车道表示学习框架,用于交通数字孪生。该框架通过联合编码静态车道几何、观测到的车辆轨迹和操作描述符,生成跨摄像头的共享语义嵌入。模型通过对比跨摄像头对齐、辅助角色监督和时间异常检测的联合目标进行训练,并在零样本跨摄像头匹配和异常检测任务上取得了优异性能,同时还能基于行为嵌入通过扩散模型生成满足特定操作规范的车道几何。

Details

Motivation: 现有交通数字孪生系统大多基于静态几何表示,无法捕捉动态功能语义(如车道在复杂交通条件下的运行方式),因此需要一种能够表征车道行为语义的表示学习方法。

Result: 在16个路侧摄像头和132条车道上,学习到的嵌入在零样本跨摄像头匹配中实现了0.004的横向排序误差和1.000的边缘角色F1分数,在窗口级异常检测中AUROC达到0.991。基于扩散的生成器在38个车道组上实现了87.9%的总体规范准确率。

Insight: 创新点在于将车道几何、车辆轨迹和操作描述符联合编码为行为接地的语义嵌入,实现了跨摄像头的语义对齐与迁移,并为下游任务(如监控和生成)提供了统一的语义接口;客观来看,其多任务联合训练框架和将行为嵌入用于条件生成的设计具有借鉴意义。

Abstract: Traffic digital twins are powerful tools for advanced traffic management, and most systems are built on static geometric representations. However, these representations fail to capture the dynamic functional semantics required for behavior-aware reasoning, such as how a lane operates under complex traffic conditions. To address this gap, we introduce GeoLaneRep, a behavior-grounded lane representation learning framework for traffic digital twins. GeoLaneRep jointly encodes static lane geometry, observed vehicle trajectories, and operational descriptors into a shared, cross-camera semantic embedding. The encoder is trained with a joint objective combining contrastive cross-camera alignment, auxiliary role supervision, and temporal anomaly detection. Across 16 roadside cameras and 132 lanes, the learned embeddings achieve a $0.004$ lateral-rank error and an edge-role F1 of $1.000$ in zero-shot cross-camera matching, and an AUROC of $0.991$ for window-level anomaly detection. We further show that the same behavioral embeddings can condition a diffusion-based generator to synthesize lane geometries that satisfy targeted operational specifications, with $87.9%$ overall specification accuracy across 38 lane groups. GeoLaneRep thus provides a semantic interface between roadside observations and downstream digital twin tasks, supporting cross-camera transfer, behavior-aware monitoring, and goal-directed lane synthesis. The framework is openly available at https://github.com/raynbowy23/GeoLaneRep.


[120] SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA? cs.CVPDF

Jongmin Shin, Ka Young Kim, Eunki Cho, Seong Tae Kim, Namkee Oh

TL;DR: 该论文提出了SurgCheck基准测试,用于诊断和量化视觉语言模型在手术视觉问答任务中对语言捷径的依赖程度,通过对比原始问题和去偏问题之间的性能差距,揭示了现有模型在手术VQA中可能过度依赖文本线索而非真实视觉理解。

Details

Motivation: 现有手术视觉问答数据集存在语言捷径问题,即问题表述隐含地限制了答案空间,导致模型性能可能反映的是对文本模式的依赖而非真正的视觉理解,因此需要一种诊断方法来量化这种依赖。

Result: 在SurgCheck基准上评估了五个视觉语言模型,发现在去偏问题上性能普遍下降,文本消融实验表明动作和目标预测主要依赖语言捷径而非视觉推理,揭示了现有基准性能可能掩盖了模型的实际视觉理解缺陷。

Insight: 创新点在于设计了配对问题诊断框架(原始问题与去偏问题对比)和四种接地线索(边界框、箭头、空间位置和迂回表述),以及引入LLM作为评判者的评估协议,为手术VQA领域提供了偏差感知的评估方法,强调了构建无偏数据集的重要性。

Abstract: Purpose: Vision-language models (VLMs) have shown promising performance in surgical visual question answering (VQA). However, existing surgical VQA datasets often contain linguistic shortcuts, where question phrasing implicitly constrains the answer space. It remains unclear whether reported performance reflects visual understanding or reliance on such linguistic shortcuts. Methods: We introduce SurgCheck, a diagnostic benchmark for quantifying linguistic shortcut reliance in surgical VQA. SurgCheck employs a paired-question design in which each surgical frame is associated with an original question containing entity names and a less-biased counterpart that removes these names while preserving identical visual content and ground-truth answers. The resulting performance gap provides a diagnostic signal of shortcut reliance. To ensure that the less-biased question remains well-defined even without entity names, four grounding cues are incorporated: bounding box, arrow, spatial position, and periphrasis. We evaluate both general-purpose and surgical-specific VLMs under zero-shot and fine-tuned settings on SurgCheck. To evaluate open-ended zero-shot responses, we introduce an LLM-as-a-judge evaluation protocol. Results: Using SurgCheck, we observe consistent performance degradation on less-biased questions across five VLMs, despite identical visual inputs. Text-only ablation reveals minimal performance drops for action and target prediction, indicating that action and target prediction is largely driven by linguistic shortcuts rather than visual reasoning. Conclusion: SurgCheck provides a controlled diagnostic framework that exposes failure modes masked by linguistic bias in existing surgical VQA benchmarks. Our findings demonstrate that strong benchmark performance does not necessarily imply faithful visual understanding, underscoring the need for bias-aware evaluation in surgical VQA.


[121] SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras cs.CVPDF

Yingqi Tang, Zhaotie Meng, Erkang Cheng, Haibin Ling

TL;DR: 本文提出了SimPB++模型,用于从多摄像头输入中同时检测2D透视视角和3D鸟瞰视角(BEV)下的物体。该模型通过一个混合解码器架构将两项任务统一到一个端到端模型中,并引入了动态查询分配和自适应查询聚合等模块实现2D与3D检测的深度交互。此外,模型还设计了用于长距离感知的裁剪缩放策略、带辅助RoI检测器的传播去噪策略,并支持混合监督以减少对昂贵3D标注的依赖。

Details

Motivation: 解决多摄像头自动驾驶场景中,同时感知2D透视视角和3D鸟瞰视角物体的挑战。现有两阶段方法仅将2D结果作为3D检测的一次性线索,未能充分利用两者的深度交互。

Result: 在nuScenes数据集上,该模型在2D和3D检测任务上均达到了最先进的性能(SOTA)。在Argoverse2数据集上,其长距离检测能力(最远150米)表现强劲。

Insight: 主要创新点包括:1)将2D与3D检测统一到端到端混合解码器架构中;2)通过动态查询分配和自适应查询聚合模块实现3D-2D-3D的循环精炼;3)引入查询组注意力机制用于多视角2D检测的组内通信;4)支持混合监督,降低对3D标注数据的依赖,具有实用价值。

Abstract: Simultaneous perception of 2D objects in perspective view and 3D objects in Bird’s Eye View (BEV) is challenging for multi-camera autonomous driving. Existing two-stage pipelines use 2D results only as a one-time cue for 3D detection. We propose SimPB++, which simultaneously detects 2D objects in perspective and 3D objects in BEV from multiple cameras. It unifies both tasks into an end-to-end model with a hybrid decoder architecture, coupling multi-view 2D and 3D decoders interactively. Two novel modules enable deep interaction: Dynamic Query Allocation adaptively assigns 2D queries to 3D candidates, and Adaptive Query Aggregation refines 3D representations using multi-view 2D features, forming a cyclic 3D-2D-3D refinement. For multi-view 2D detection, we use Query-group Attention for intra-group communication. We also design a Crop-and-Scale strategy for long-range perception and a Propagating Denoising strategy with an auxiliary RoI detector. SimPB++ supports mixed supervision with 2D-only and fully annotated data, reducing reliance on expensive 3D labels. Experiments show state-of-the-art performance on nuScenes for both tasks and strong long-range detection (up to 150m) on Argoverse2.


[122] CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models cs.CV | cs.GRPDF

Vladislav Pyatov, Gleb Bobrovskikh, Saveliy Galochkin, Nikita Boldyrev, Oleg Voynov

TL;DR: CADFS是一个以数据为中心的框架,旨在使大型视觉语言模型能够生成复杂的CAD设计历史。它通过引入基于FeatureScript的表示方法,并构建了一个包含45万个真实世界CAD模型、涵盖15种建模操作的数据集,解决了现有生成式CAD系统因简化表示和有限数据集而仅限于草图拉伸操作的问题。

Details

Motivation: 现有生成式CAD系统受限于简化的表示方法和有限的数据集,只能处理草图拉伸操作,无法生成复杂的CAD设计历史。CADFS旨在通过更丰富的表示和更大规模的数据集来突破这一限制,提升生成式CAD的复杂性和真实性。

Result: 在文本条件CAD生成和基于图像的重建任务上,使用该表示微调的视觉语言模型取得了最先进的结果,相比先前框架能生成更准确、多样且特征丰富的设计。消融实验表明,框架的每个组件(FeatureScript表示、扩展的操作集和表示对齐的文本描述)都显著提升了性能。

Insight: 核心创新在于提出了一个基于FeatureScript的、更丰富的CAD程序表示方法,并构建了大规模、高质量、多模态标注的真实世界CAD程序数据集。这为数据驱动的生成式CAD研究提供了关键基础设施,并证明了数据质量和表示对齐对模型性能的重要性。

Abstract: We introduce CADFS, a data-centric framework that enables large vision-language models to generate complex CAD design histories. Existing generative CAD systems are restricted to sketch-extrude operations due to simplified representations and limited datasets. We address this by introducing a FeatureScript-based representation and constructing a dataset of 450k real-world CAD models spanning 15 modeling operations. We obtain the dataset via a new pipeline that reconstructs clean, executable FeatureScript programs and provides multimodal annotations. Fine-tuning a VLM on this representation yields state-of-the-art results in text-conditioned CAD generation and image-based reconstruction, producing more accurate, diverse, and feature-rich designs than prior frameworks. Ablations show that each individual component of our framework, i.e., the FeatureScript representation, the extended operation set, and representation-aligned textual descriptions, significantly improves performance. Our framework substantially broadens the complexity and realism achievable in generative CAD. The CADFS framework and the new dataset are available at https://voyleg.github.io/cadfs/.


[123] Exploring Data-Free LoRA Transferability for Video Diffusion Models cs.CVPDF

Yuchen Wang, Wenliang Zhong, Lichen Bai, Zikai Zhou, Shitong Shao

TL;DR: 本文针对视频扩散模型(如步长蒸馏和因果蒸馏变体)中,现有LoRA适配器因权重空间不匹配而无法有效迁移的问题,提出了一种无数据的解决方案。通过分析发现,不兼容性源于奇异子空间内共享功能簇的谱干扰,并提出了基于谱密度的簇感知谱仲裁框架CASA,以动态协调目标流形保护与LoRA对齐恢复,从而有效缓解伪影并恢复LoRA功能。

Details

Motivation: 现有视频扩散模型变体(步长蒸馏、因果蒸馏)与预训练LoRA适配器之间存在权重空间不匹配,直接应用会导致风格退化和结构崩溃,但其内在机制尚不明确,需要一种无需原始训练数据的方法来解决LoRA的可迁移性问题。

Result: 大量实验表明,所提出的CASA框架能有效减轻伪影并恢复LoRA功能,在视频扩散模型上实现了LoRA的成功迁移。

Insight: 创新点在于从权重空间的谱分析角度揭示了不兼容性的根源(共享功能簇的谱干扰),并提出了数据无依赖的、基于谱密度动态仲裁的CASA框架,为模型适配器的跨变体迁移提供了新的理论视角和实用方法。

Abstract: Video diffusion models leveraging step distillation or causal distillation have achieved remarkable performance. However, adapting existing LoRAs to these variants remains a critical challenge due to weight space mismatches. We observe that direct application leads to style degradation and structural collapse, yet the underlying mechanisms remain poorly understood. To fill this gap, we delve into the weight space and identify that the incompatibility stems from spectral interference within shared functional clusters defined over singular subspaces. Specifically, our analysis reveals that while both paradigms respect spectral rigidity, they establish conflicting routing pathways that clash through constructive overload or destructive cancellation. To address this issue, we propose Cluster-Aware Spectral Arbitration (CASA), a data-free framework that dynamically arbitrates between safeguarding the target’s manifold and restoring LoRA alignment based on spectral density. Extensive experiments demonstrate that CASA effectively mitigates artifacts and revives LoRA functionality. Our code is available at https://github.com/Noahwangyuchen/CASA


[124] From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments cs.CVPDF

Maximilian Kellner, Dominik Merkle, Michael Brunklaus, Alexander Reiterer

TL;DR: 本文针对大规模3D点云因数据量过大而难以直接处理的问题,提出并比较了多种点云裁剪策略,包括指数、高斯和线性裁剪方法,以替代传统的球形裁剪。通过在不同室内外数据集上评估两种3D深度学习模型,研究发现改变裁剪策略能提升模型性能,特别是在大规模室外场景中取得了新的SOTA结果。

Details

Motivation: 传统球形裁剪方法在处理大规模3D点云时会丢失周围几何上下文信息,导致语义理解受限,因此需要探索能保持更多上下文且点数量可控的裁剪策略。

Result: 在多个室内外环境数据集上评估显示,新裁剪策略(特别是高斯裁剪)能显著提升模型性能,尤其是在大规模室外场景中达到了新的SOTA水平。

Insight: 创新点在于提出非球形裁剪方法(如高斯裁剪)以保留更丰富的几何上下文,从而改善3D点云语义分割性能;客观分析表明,裁剪策略的优化是提升大规模3D场景理解的有效且低成本的途径。

Abstract: Large-scale 3D point clouds can consist of billions of points. Even after downsampling, these point clouds are too large for modern 3D neural networks. In order to develop a semantic understanding of the scene, the point clouds are divided into smaller subclouds that can be processed. Typically, this division is done using spherical crops, resulting in a loss of surrounding geometric context. To address this issue, we propose alternative methods that produce subclouds with larger crop sizes while maintaining a similar number of points. Specifically, we compare exponential, Gaussian, and linear cropping methods with the spherical method. We evaluated two 3D deep learning model architectures using multiple indoor and outdoor environment datasets. Our results demonstrate that altering the cropping strategy can enhance model performance, especially for large-scale outdoor scenes, yielding new state-of-the-art results. Code is available at https://github.com/mvg-inatech/point_cloud_cropping


[125] Ultrasound Vision-Language Alignment via Contrastive Learning cs.CV | cs.LGPDF

Zhuoyang Lyu, Yiyang Zhang, Tongxin Wang, Ruirui Lan

TL;DR: 本文提出了EchoCare-CLIP,一种基于对比学习的超声图像与临床文本对齐框架,旨在解决超声基础模型在零样本和少样本任务中因缺乏任务特定标注而受限的问题。通过构建包含16K多器官图像-文本对的数据集,并比较不同文本编码器和字幕生成策略,研究发现对齐性能的提升并不总是保证下游任务表现,且模板生成的字幕与LLM生成的字幕效果相当。

Details

Motivation: 解决超声基础模型仅依赖视觉信息、在标注稀缺的新任务中零样本和少样本迁移能力有限的问题,通过对齐超声图像与临床文本来增强模型的泛化能力。

Result: 在跨模态对齐任务中,最佳配置达到0.682的对齐分数;在外部数据集BUSI和AULI上,基于CLIP的变体在零样本分类中分别达到0.709和0.626的分数,优于基线模型;线性探测和少样本适应性能因数据集而异,显示领域适应与表示泛化之间的权衡。

Insight: 创新点包括构建多器官超声图像-文本数据集,并比较不同编码器和字幕策略;客观分析表明,对齐性能提升不一定转化为下游任务改进,且模板字幕在质量上可与LLM生成字幕媲美,强调领域适应、编码器能力和字幕监督质量需平衡以实现稳健临床迁移。

Abstract: Ultrasound foundation models have achieved strong performance on structured prediction tasks but remain exclusively vision-based, limiting zero-shot and few-shot transfer to novel tasks where task-specific annotation is scarce. We address this gap with EchoCare-CLIP, a CLIP-style dual-encoder contrastive framework that aligns ultrasound images with clinical text in a shared embedding space. We curate a multi-organ corpus of over 16K image-text pairs spanning breast, liver, lung, and thyroid, with over 78% of captions derived from expert-annotated reports, and complement the remainder with a three-tier template-based and LLM-based caption generation pipeline. We evaluate model configurations spanning two text encoder families (CLIP, BioClinicalBERT) and two caption strategies (template-based, LLM-generated) against OpenAI CLIP and BiomedCLIP baselines. Our trained models consistently improve cross-modal alignment over baselines, with the best configuration achieving a paired alignment score of 0.682. However, stronger alignment does not guarantee better downstream performance: CLIP-based variants with partial fine-tuning achieve the strongest zero-shot classification on external held-out datasets (0.709 on BUSI; 0.626 on AULI), while full end-to-end fine-tuning degrades transfer due to overfitting. On linear probing and few-shot adaptation, model rankings are dataset-dependent, reflecting a trade-off between domain adaptation and representational generalizability. We further show that template-based captions match or outperform LLM-generated captions, suggesting lexical diversity is not a proxy for caption quality. Taken together, our results demonstrate that ultrasound vision-language alignment is achievable from public data alone, but robust clinical transfer requires careful balancing of domain adaptation, encoder capacity, and caption supervision quality.


[126] WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization cs.CV | cs.CLPDF

Wei Tao, Xiaoyang Qu, Peiqiang Wang, Guokuan Li, Jiguang Wan

TL;DR: 本文提出了一种名为WindowQuant的新方法,用于优化视频语言模型(VLMs)推理过程中的键值(KV)缓存。该方法基于窗口级相似性进行混合精度量化,包括窗口级量化搜索和窗口级KV缓存计算两个模块,旨在减少推理延迟和GPU内存使用,同时保持模型精度并提高硬件效率。

Details

Motivation: 视频语言模型(VLMs)的视觉令牌序列过长,导致推理延迟和GPU内存使用不可容忍。现有基于令牌粒度的KV缓存混合精度量化方法搜索过程耗时且在计算中硬件效率低下,因此需要一种更高效的方法来优化KV缓存。

Result: 大量实验表明,WindowQuant在多个数据集上优于最先进的VLM模型和KV缓存量化方法,实现了更高的效率和性能。

Insight: 创新点在于提出窗口级(而非令牌级)的混合精度量化策略,通过基于窗口相似性的快速搜索确定最优位宽配置,并通过量化前重排KV缓存窗口来避免硬件效率低下,从而在保持精度的同时显著提升推理速度和内存效率。

Abstract: Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in VLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called WindowQuant, which employs window-adaptive mixed-precision quantization to optimize the KV cache. WindowQuant consists of two modules: window-level quantization search and window-level KV cache computation. Window-level quantization search quickly determines the optimal bit-width configuration of the KV cache windows based on the similarity scores between the corresponding visual token windows and the text prompt, maintaining the model accuracy. Furthermore, window-level KV cache computation reorders the KV cache windows before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that WindowQuant outperforms state-of-the-art VLM models and KV cache quantization methods on various datasets.


[127] From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs cs.CVPDF

Le Zhang, Jihan Yang, Soundarya Krishnan, Jimit Majmudar, Xiou Ge

TL;DR: 本文提出了一个名为SFI-Bench的空间功能智能基准测试,用于评估多模态大语言模型(MLLMs)从几何感知到理解物体功能的高阶认知能力。该基准基于第一人称室内视频,包含1500多个专家标注的问题,系统性地评估结构化空间推理和功能推理两个维度。

Details

Motivation: 现有基准主要评估MLLMs的低层几何感知能力,但无法衡量实现具身智能所需的高阶认知能力,如理解物体功能及其在环境中的用途。本文旨在填补这一空白。

Result: 实验表明,当前的MLLMs在将空间记忆与功能推理及外部知识结合方面存在显著困难,揭示了实现具身智能的一个关键瓶颈。SFI-Bench作为一个诊断工具,用于衡量向更具认知能力的多模态智能体发展的进展。

Insight: 创新点在于提出了一个专注于空间-功能推理的综合性视频基准,将结构化空间推理(如理解复杂布局)与功能推理(如推断物体可供性和上下文效用)相结合,并通过条件计数、多跳关系推理、功能配对和基于知识的故障排除等任务直接挑战模型的感知、记忆和推理整合能力。

Abstract: Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing the higher-order cognitive abilities required for grounded intelligence. To address this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench systematically evaluates two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. The benchmark includes tasks such as conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenging models to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence. SFI-Bench therefore provides a diagnostic tool for measuring progress toward more cognitively capable and truly grounded multimodal agents.


[128] Video Generation with Predictive Latents cs.CVPDF

Yian Zhao, Feng Wang, Qiushan Guo, Chang Liu, Xiangyang Ji

TL;DR: 本文提出了一种名为预测性视频变分自编码器(PV-VAE)的新方法,通过引入预测性重建目标,将预测学习与视频重建相结合,以提升视频生成质量。该方法在编码时随机丢弃未来帧,仅基于部分过去观测进行编码,并训练解码器同时重建观测帧和预测未来帧,从而鼓励潜在空间编码时间预测结构,增强对视频动态的连贯理解。

Details

Motivation: 现有视频VAE虽能实现良好的重建质量,但重建的持续优化未必能提升生成性能。如何增强视频潜在空间的可扩散性(diffusability)是一个关键且未解决的挑战。受预测性世界建模原理启发,本研究探索预测学习在改进视频生成建模中的潜力。

Result: 在UCF101数据集上,PV-VAE相比Wan2.2 VAE实现了52%的更快收敛和34.42的FVD(Fréchet Video Distance)提升,表现出优越的视频生成性能。综合分析表明,PV-VAE具有良好的可扩展性(生成性能随VAE训练而提升),并在下游视频理解任务中带来一致增益。

Insight: 核心创新点在于提出了一个简单有效的预测性重建目标,将预测学习统一到视频重建框架中。这促使潜在空间编码时间预测结构和运动先验,从而捕获更好的时间连贯性,不仅提升了生成质量,也增强了潜在空间对下游任务的有用性。

Abstract: Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.


[129] FLoRA: Fusion-Latent for Optical Reconstruction and Flood Area Segmentation via Cross-Modal Multi-Task Distillation Network cs.CV | cs.AIPDF

Jagrati Talreja, Tewodros Syum Gebre, Leila Hashemi-Beni

TL;DR: FLoRA是一个用于洪水灾害管理的跨模态多任务框架,它通过融合光学和SAR数据的互补优势,从Sentinel-1 SAR数据中联合重建高保真光学图像并分割洪水区域。该框架利用一个轻量级光学教师模型提供的金字塔特征,通过多尺度窗口交叉注意力和FiLM条件化,将SAR表征引导至一个融合潜在空间,从而实现SAR到光学图像的精细重建和洪水区域分割两个互补任务。

Details

Motivation: 当前洪水制图方法未能充分利用星载影像的潜力。光学数据可解释性高但受环境条件限制,SAR数据能全天候覆盖但视觉可解释性较低。论文旨在通过融合这两种模态的互补优势,解决从SAR数据中准确重建光学图像和分割洪水区域的挑战。

Result: 在SEN1FLOODS11、DEEPFLOOD和SEN12MS数据集上的评估表明,FLoRA在PSNR、SSIM和LPIPS指标上超越了融合基线方法,证明了其在生成语义忠实且物理一致的洪水情报方面的有效性。

Insight: 创新点在于提出了一个教师引导的跨模态多任务蒸馏网络,通过多尺度窗口交叉注意力、FiLM条件化和门控残差机制,在融合潜在空间中实现互补任务(图像重建与分割)的联合优化。该方法将特征蒸馏约束与结构保真度、频谱真实性和水文感知的边缘对齐损失相结合,提升了多模态融合的性能。

Abstract: Accurate flood water mapping is critical for disaster management, yet current methods struggle to fully exploit the potential of spaceborne imagery. Optical data offers high interpretability but is limited by environmental conditions, whereas SAR provides reliable all-weather coverage with reduced visual interpretability. FLoRA (Fusion Latent for Optical Reconstruction and Area Segmentation) is a cross-modal multi-task framework that jointly reconstructs high-fidelity optical imagery and segments flood water regions from Sentinel 1 SAR by fusing the complementary strengths of optical and SAR data. During training, a lightweight optical teacher (driven by RGB and NDVI priors) provides pyramidal features that guide SAR representations into a fusion latent space via multiscale windowed cross attention and FiLM conditioning, with gated residuals preventing overcorrection. This design enables multi-task learning across two complementary objectives: (a) SAR-to-optical translation for fine-grained RGB reconstruction and (b) flood water region segmentation for hydrologic interpretation. The dual decoders are optimized using Charbonnier SSIM for structural fidelity, edge FFT magnitude losses for spectral realism, and Dice BCE hydrology-aware edge alignment for precise flood water delineation. A feature distillation constraint further aligns fused SAR features with the optical teacher’s manifold. Evaluations on SEN1FLOODS11, DEEPFLOOD, and SEN12MS demonstrate that FLoRA surpasses fusion baselines in PSNR, SSIM, and LPIPS, demonstrating that multi-modal fusion within a teacher-guided latent space yields semantically faithful and physically consistent flood-water intelligence from spaceborne observations.


[130] PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature cs.CV | cs.CLPDF

Verena Jasmin Hallitschke, Carsten Eickhoff, Philipp Berens

TL;DR: 本文介绍了PubMed-Ophtha,一个从PubMed Central开放获取文章中构建的大规模眼科图像-文本数据集,包含超过10万对图像-标题数据。该数据集通过从PDF中高分辨率提取图像、分解图面板、标注成像模态和标记状态,并利用LLM方法生成面板级标题,旨在支持眼科视觉-语言模型的训练。

Details

Motivation: 当前眼科视觉-语言模型的发展受限于大规模、高质量图像-文本数据集的稀缺,因此本文旨在构建一个开放资源来解决这一数据瓶颈问题。

Result: 在数据构建过程中,面板和图像检测模型的mAP@0.50分别达到0.909和0.892,图像提取的中位IoU为0.997,面板级标题分割在人工标注数据上达到平均句子BLEU分数0.913。

Insight: 创新点在于直接从PDF高分辨率提取并分解图像为面板和单个图像,结合成像模态和标记状态标注,以及采用两步LLM方法生成面板级标题,为眼科领域提供了结构化的高质量数据集和完整的生成流程以支持可复现性。

Abstract: Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality – color fundus photography, optical coherence tomography, retinal imaging, or other – and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a mAP@0.50 of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.


[131] Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score cs.CV | cs.LGPDF

Abdullah Ahmad Khan, Hamid Laga, Ferdous Sohel

TL;DR: 本文首次系统研究了多模态机器学习中遗忘任务评估指标的可靠性问题,发现现有五个常用指标(FA、RA、MIA、AD、JS)在三个VQA基准测试上对遗忘方法排名存在显著冲突,并揭示了指标间形成两个对立集群的现象。研究进一步表明多模态任务中指标不一致性比单模态分类更严重,为此提出了一个基于指标与理想模型距离相关性的统一质量评分(UQS),以提供更稳定的评估。

Details

Motivation: 为满足GDPR等数据隐私法规要求,视觉语言模型需具备遗忘特定数据的能力,但当前多模态遗忘任务的评估实践缺乏一致性,不同评估指标对遗忘方法的排名结果相互矛盾,阻碍了该领域的可靠进展。

Result: 在LLaVA-1.5-7B和BLIP-2 OPT-2.7B模型上的实验表明,五个标准指标在三个VQA基准(MLLMU-Bench, UnLOK-VQA, MMUBench)上排名冲突,肯德尔τ分析揭示指标分为{FA, RA, MIA}和{AD, JS}两个对立集群(τ_FA_AD = -0.26)。多模态VQA任务中指标间平均一致性(τ = 0.086)显著低于单模态分类任务(τ = 0.158)。提出的UQS指标在100次随机权重扰动下表现出稳定的排名一致性(τ = 0.647 ± 0.262)。

Insight: 论文的核心创新在于首次系统揭示了多模态遗忘评估中存在的指标不可靠性问题,并定量证明了多模态路径会放大这种不一致性。提出的UQS通过基于指标与理想遗忘模型距离的Spearman相关性来加权融合各指标,为领域提供了一个原则性的统一评估框架,其方法论对机器学习其他需要多指标评估的领域也具有借鉴意义。

Abstract: Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, {FA, RA, MIA} and {AD, JS}, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric’s Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +- 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at https://github.com/neurips26/UnifiedUnl.


[132] MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings cs.CV | cs.AI | cs.LGPDF

Dineth Jayakody, Pasindu Thenahandi, Chameli Dommanige

TL;DR: 本文提出了MultiSense-Pneumo,一个用于资源受限环境下肺炎筛查的多模态学习框架。该框架整合了结构化症状描述、咳嗽音频、口语和胸部X光片四种模态,通过确定性症状分诊、基于LightGBM的声学分类、基于ResNet 18的域对抗X光分析、基于Transformer的语音识别以及可解释的多模态融合算子,将各模态转化为归一化的风险信号并聚合为统一的筛查估计。该系统设计用于在有限计算资源下离线部署,适合社区卫生工作者和农村诊所使用。

Details

Motivation: 肺炎是全球发病和死亡的主要原因,尤其在资源匮乏地区,获取影像、实验室检测和专科护理受限。现有计算方法多为单模态(主要依赖X光片),而临床评估本质上是多模态的(依赖症状、呼吸模式和胸部影像等异质证据),因此需要开发一个整合多种证据、适合资源受限环境的多模态筛查框架。

Result: 实验结果表明,该框架的X光分析路径在领域偏移下表现出鲁棒性,但也凸显了声学信号在少数类召回方面的局限性。论文未明确提及在特定基准测试上达到SOTA水平,而是将其定位为一个研究原型,用于筛查和分诊支持,而非临床验证的诊断系统。

Insight: 创新点在于提出了一个整合症状、音频、语音和X光的多模态框架,并设计了可解释的融合算子将各模态转化为归一化风险信号进行聚合。从客观角度看,其模块化、可离线部署的设计以及对计算资源受限场景的针对性考虑,为在真实世界、资源有限环境中部署AI辅助医疗系统提供了有价值的架构参考。

Abstract: Pneumonia remains a leading global cause of morbidity and mortality, particularly in low resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, and chest imaging, making screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal framework for pneumonia oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM based acoustic classification, domain adversarial radiograph analysis using ResNet 18, transformer based speech recognition, and an interpretable multimodal fusion operator. Each modality is transformed into a normalized risk signal and aggregated into a unified screening estimate, enabling transparent and modular decision support. MultiSense-Pneumo is designed for real world deployment under modest computational constraints and can operate fully offline on standard laptop class hardware, making it suitable for community health workers, rural clinics, and emergency response settings. Experimental results demonstrate robustness of the radiograph pathway under domain shifts, while highlighting limitations in minority class recall for acoustic signals. MultiSense-Pneumo is intended as a research prototype for screening and triage support rather than a clinically validated diagnostic system.


[133] InfiltrNet: Dual-Branch CNN-Transformer Architecture for Brain Tumor Infiltration Risk Prediction cs.CV | cs.LGPDF

S M Asif Hossain, Shruti Kshirsagar

TL;DR: 本文提出了一种名为InfiltrNet的新型双分支架构,用于从多模态MRI预测脑胶质瘤的三区浸润风险图。该架构通过交叉注意力融合模块,将CNN编码器与Swin Transformer编码器相结合,并采用基于距离变换的标签生成策略从标准BraTS标注中推导可复现的浸润风险区。模型训练结合了Dice-CrossEntropy损失和边界感知损失,并在中间解码器层增加了辅助监督头。

Details

Motivation: 现有深度学习方法主要关注可见肿瘤的分割,而未能有效估计周围组织的浸润风险,这对于手术规划和放射治疗至关重要。本文旨在解决从MRI预测肿瘤浸润空间范围这一关键问题。

Result: 在BraTS 2020和BraTS 2025数据集上的广泛实验表明,InfiltrNet在预测浸润风险方面优于五个已建立的基线模型。

Insight: 主要创新点包括:1)结合CNN与Transformer优势的双分支架构及交叉注意力融合;2)从标准分割标注生成浸润风险标签的策略;3)结合多种损失函数和辅助监督的训练方案。模型的可解释性分析也证实其关注了临床相关的瘤周区域。

Abstract: Gliomas are aggressive brain tumors that infiltrate surrounding tissue beyond the visible tumor margins observed on Magnetic Resonance Imaging (MRI). Predicting the spatial extent of this infiltration is essential for surgical planning and radiation therapy, yet existing deep learning approaches focus on segmenting the visible tumor rather than estimating infiltration risk in the surrounding tissue. This paper presents InfiltrNet, a novel dual-branch architecture that combines a convolutional neural network (CNN) encoder with a Swin Transformer encoder through cross-attention fusion modules to predict three-zone infiltration risk maps from multimodal MRI. A label generation strategy based on distance transforms is proposed to derive reproducible infiltration risk zones from standard Brain Tumor Segmentation (BraTS) annotations. InfiltrNet is trained with a combined Dice-CrossEntropy and boundary-aware loss augmented by auxiliary supervision heads at intermediate decoder levels. Extensive experiments on BraTS 2020 and BraTS 2025 demonstrate that InfiltrNet outperforms five established baselines. Explainability analysis using GradCAM++ and Occlusion sensitivity confirms that the model attends to clinically relevant peritumoral regions.


[134] SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters cs.CVPDF

Yagiz Nalcakan, Hyeongjin Ju, Incheol Park, Sanghyeop Yeo, Youngwan Jin

TL;DR: SpectraDINO是一个多光谱视觉基础模型,旨在弥合RGB预训练模型与近红外、短波红外和长波红外等多光谱模态之间的领域鸿沟。它通过在冻结的DINOv2 ViT骨干网络上为每个模态添加轻量级瓶颈适配器,并采用多阶段师生训练协议进行知识蒸馏和对齐,从而将强大的RGB表示能力扩展到不可见光谱。

Details

Motivation: 解决在大规模RGB数据上预训练的视觉基础模型难以直接应用于提供互补感知能力的多光谱成像模态(如NIR、SWIR、LWIR)的问题,这些模态与RGB存在根本性的领域差距。

Result: 在具有挑战性的NIR、SWIR和LWIR基准测试上进行多光谱目标检测和语义分割评估,使用广泛采用的融合策略,SpectraDINO在大多数基准测试中达到了最先进的性能。

Insight: 创新点在于使用轻量级的、每个模态独立的瓶颈适配器来扩展冻结的RGB骨干网络,避免了从头训练;并提出了一种多阶段师生训练协议,结合余弦蒸馏、对称对比损失、块级对齐和一种新颖的邻域结构保持损失,实现了强跨模态对齐且不灾难性遗忘RGB先验知识。这为构建通用多光谱骨干网络提供了一种高效、可扩展的适配器范式。

Abstract: Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplored. These spectral modalities offer complementary sensing capabilities critical for robust perception in adverse conditions, but present a fundamental domain gap relative to RGB-centric pretrained models. We present SpectraDINO, a multispectral VFM that bridges this spectral gap by extending DINOv2 ViT backbones to beyond-visible modalities through lightweight, per-modality bottleneck adapters, while preserving the rich representations of the frozen RGB backbone. We introduce a multi-stage teacher-student training protocol in which a frozen DINOv2 teacher guides a spectral student via cosine distillation, symmetric contrastive loss, patch-level alignment, and a novel neighborhood-structure-preservation loss. This staged curriculum enables strong cross-modal alignment without catastrophic forgetting of RGB priors. We evaluate SpectraDINO on multispectral object detection and semantic segmentation across challenging NIR, SWIR, and LWIR benchmarks using widely adopted fusion strategies. SpectraDINO achieves state-of-the-art performance across most benchmarks, validating its effectiveness as a general-purpose backbone for spectral generalization. The code and weights for model variants are available at https://github.com/Yonsei-STL/SpectraDINO.


[135] Rethinking Electro-Optical Vision Foundation Models for Remote Sensing Retrieval: A Controlled Comparison with Generalist VFM cs.CV | cs.AIPDF

Hyobin Park, Minseok Seo, Dong-Geol Choi

TL;DR: 本文对遥感图像检索任务中的专用与通用视觉基础模型进行了控制变量比较,发现通用视觉基础模型在检索性能上与专用模型相当甚至更优,且在跨场景泛化中表现更稳定。

Details

Motivation: 遥感领域数据获取成本高且标注依赖专家知识,视觉基础模型利用大规模无标签数据的优势在该领域尤为重要;然而,现有遥感专用视觉基础模型是否比通用视觉基础模型在检索任务中更有效尚不明确。

Result: 在相同数据集、检索协议和评估指标下,通用视觉基础模型与现有遥感专用模型竞争,有时表现更优;专用模型在跨场景评估中性能显著下降,而通用模型转移更稳定。

Insight: 论文创新点在于通过控制实验揭示了遥感专用预训练未必能带来更强的检索导向表示;客观分析认为,未来遥感视觉基础模型需更好地利用遥感图像的物理、空间、光谱和地理特性,以克服当前预训练策略的局限。

Abstract: Vision foundation models have attracted significant attention for their ability to leverage large-scale unlabeled visual data. This advantage is particularly important in remote sensing, where data acquisition is costly and annotation often requires expert knowledge. Recent electro-optical vision foundation models aim to learn domain-specific representations from remote sensing imagery, but it remains unclear whether they are more effective than strong generalist vision foundation models under retrieval-based evaluation. In this study, we conduct a controlled comparison between representative EO-specific and generalist vision foundation models for remote sensing image retrieval. Using the same datasets, retrieval protocol, and evaluation metric, we evaluate both in-domain performance and cross-scene generalization. Our results show that strong generalist vision foundation models are competitive with, and in some cases outperform, existing EO-specific models. Moreover, EO-specific models often suffer from substantial degradation under cross-scene evaluation, while generalist models show more stable transfer. These findings suggest that EO pretraining alone does not guarantee stronger retrieval-oriented remote sensing representations. We discuss the limitations of current EO-specific pretraining strategies and highlight the need for future EO vision foundation models to better exploit the physical, spatial, spectral, and geographic characteristics of remote sensing imagery.


[136] A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets cs.CVPDF

Stefanos Pasios

TL;DR: 本文提出了一种混合方法,通过结合扩散模型(FLUX.2-4B Klein)和传统图像到图像转换模型(REGEN)的优势,来缩小游戏引擎生成的合成数据集与真实图像之间的外观差距,从而提升合成数据集的照片真实感。

Details

Motivation: 尽管现代游戏引擎(如采用光线追踪技术)的视觉保真度已显著提高,但合成图像与真实图像之间仍存在明显的sim2real外观差距,这限制了合成数据集在现实世界计算机视觉应用中的有效利用。

Result: 实验表明,REGEN模型在性能上优于FLUX.2-4B Klein扩散模型,而将两者结合的混合方法能够比单独使用任一模型获得更好的视觉真实感,同时保持语义一致性。

Insight: 论文的创新点在于提出了一种混合方法,将基于扩散模型的强大几何与材质变换能力,与图像到图像转换技术的分布匹配能力相结合,以协同提升合成数据的真实感,这为利用生成模型改进合成数据质量提供了新思路。

Abstract: Video game engines have been an important source for generating large volumes of visual synthetic datasets for training and evaluating computer vision algorithms that are to be deployed in the real world. While the visual fidelity of modern game engines has been significantly improved with technologies such as ray-tracing, a notable sim2real appearance gap between the synthetic and the real-world images still remains, which limits the utilization of synthetic datasets in real-world applications. In this letter, we investigate the ability of a state-of-the-art image generation and editing diffusion model (FLUX.2-4B Klein) to enhance the photorealism of synthetic datasets and compare its performance against a traditional image-to-image translation model (REGEN). Furthermore, we propose a hybrid approach that combines the strong geometry and material transformations of diffusion-based methods with the distribution-matching capabilities of image-to-image translation techniques. Through experiments, it is demonstrated that REGEN outperforms FLUX.2-4B Klein and that by combining both FLUX.2-4B Klein and REGEN models, better visual realism can be achieved compared to using each model individually, while maintaining semantic consistency. The code is available at: https://github.com/stefanos50/Hybrid-Sim2Real


[137] Graph-Augmented Topological Internalization with Dual-Stream Classifiers for Medical Report Generation cs.CVPDF

Moyu Tang, Chupei Tang, Junxiao Kong, Di Wang, Tianchi Lu

TL;DR: 本文提出了一种图增强双流医学报告生成框架GDMRG,通过拓扑知识内化模块TKI利用图卷积网络GCN显式建模疾病共现先验,并构建双流分类系统:主分支在拓扑约束下生成离散诊断提示,辅助分支采用非对称优化策略动态校准高度不平衡样本的决策边界。同时设计了诊断引导的空间注意力DGSA,利用临床语义重新校准视觉编码器以缓解特征幻觉。在MIMIC-CXR数据集上验证了其临床有效性和语言流畅性,并在IU X-Ray数据集上展示了零样本泛化能力。

Details

Motivation: 解决现有医学报告生成方法将胸部异常视为孤立分类目标,忽略疾病共现性且难以将医学拓扑结构转化为显式数据关联,从而限制模型对复杂或细微病变的推理能力的问题。

Result: 在MIMIC-CXR数据集上实现了有竞争力的临床有效性(CE)并保持了自然语言流畅性;在IU X-Ray数据集上展示了鲁棒的零样本泛化能力。

Insight: 创新点包括:1)通过图卷积网络显式参数化疾病共现先验的拓扑知识内化模块,实现无需外部检索的知识注入;2)双流分类系统结合拓扑约束与非对称优化,处理样本不平衡;3)诊断引导的空间注意力建立诊断与视觉定位的逻辑闭环,缓解特征幻觉。该框架提供了一个集成且可解释的医学报告生成范式。

Abstract: Automated medical report generation, MRG, holds substantial value for alleviating radiologist workload and enhancing diagnostic efficiency. However, mainstream approaches typically treat diverse chest abnormalities as isolated classification targets. This paradigm often overlooks inherent disease co-occurrences and struggles to translate medical topological structures into explicit data correlations, constraining the model’s reasoning capacity on complex or subtle lesions. To address this, we propose a Graph-Augmented Dual-Stream Medical Report Generation with Topological Internalization, GDMRG. Our framework introduces a Topological Knowledge Internalization module, TKI, which leverages a Graph Convolutional Network, GCN, to generate an explicit parameterized weight matrix based on global disease co-occurrence priors. This facilitates efficient topological knowledge injection without relying on external retrieval mechanisms. Building upon this, we construct a dual-stream classification system: the main branch generates discrete diagnostic prompts under topological constraints, while the auxiliary branch employs an asymmetric optimization strategy to dynamically calibrate decision boundaries for highly imbalanced samples. Concurrently, to establish a logical closed loop between diagnosis and visual grounding, we design a diagnostic-driven Diagnosis-Guided Spatial Attention, DGSA, that utilizes high-dimensional clinical semantics to recalibrate the visual encoder, mitigating feature hallucinations. Comprehensive experiments on the MIMIC-CXR dataset demonstrate that GDMRG achieves competitive clinical efficacy, CE, while maintaining natural language fluency. Furthermore, our model exhibits robust zero-shot generalization on the IU X-Ray dataset. In summary, this work presents an integrated and interpretable paradigm for medical report generation.


[138] Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning cs.CV | cs.AIPDF

Haoyu Wang, Haonan Wang, Yuyan Chen, Jun Chen, Gang Liu

TL;DR: 本文提出了一种通过归纳-演绎推理增强多模态上下文学习(ICL)的框架。该框架旨在解决视觉语言模型(VLMs)在ICL中存在的归纳鸿沟问题,具体通过视觉令牌压缩、动态注意力再平衡和思维链引导来重构多模态ICL过程,并结合监督微调与强化学习进行训练。

Details

Motivation: 动机在于解决视觉语言模型在多模态上下文学习中存在的根本性限制:模型常基于错误推理得出正确答案,难以从示例中提取一致规则,且受到冗余视觉令牌和注意力分布偏向初始图像等视觉层面障碍的加剧。

Result: 在涵盖视觉感知、逻辑推理、STEM问题和讽刺检测的八个基准测试中,该方法相对于标准ICL基线在多个开源VLMs上取得了一致且显著的改进。

Insight: 创新点在于将多模态ICL重构为原则性的归纳-演绎过程,具体包括基于相似性的视觉令牌压缩模块、动态注意力再平衡机制、显式思维链引导范式,以及结合可验证奖励的监督微调与强化学习辅助训练流程,旨在赋予模型真正的多模态归纳能力。

Abstract: In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.


[139] FEAT: Fashion Editing and Try-On from Any Design cs.CV | cs.AIPDF

Soye Kwon, Keonyoung Lee, Dahuin Jung, Jaekoo Lee

TL;DR: FEAT是一种支持从任意设计源(包括艺术品、抽象图像和自然照片)进行服装编辑和虚拟试穿的方法,通过解耦双注入机制处理服装和非服装设计输入,并利用正交引导噪声融合实现无训练残留服装移除和区域特定噪声策略,从而支持完整服装和配饰的编辑与试穿。

Details

Motivation: 现有方法局限于服装相关图像作为设计源,无法利用艺术品、抽象图像等创意设计,且不支持包括配饰在内的完整服装编辑与试穿,因此需要一种更灵活的方法来扩展设计来源和功能。

Result: 在广泛实验中,FEAT在设计灵活性、提示一致性和视觉真实感方面达到了最先进的性能(SOTA)。

Insight: 创新点包括解耦双注入(DDI)机制,通过内容和风格解耦选择性注入设计线索,以及正交引导噪声融合(OGNF)这一无训练机制,利用正交投影移除残留服装并应用区域特定噪声策略,实现了对服装和配饰的虚拟试穿支持。

Abstract: Fashion design aims to express a designer’s creative intent and to depict how garments interact with the human body. Recent methods condition on multimodal inputs to support garment editing and virtual try-on. However, existing methods still (i) confine design to garment-related images, excluding creative design sources such as artwork, abstract imagery, and natural photographs, and (ii) cannot support complete outfits, including accessories. We present FEAT (Fashion Editing And Try-On from Any Design), a method that enables editing and try-on across garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design flexibility, prompt consistency, and visual realism.


[140] Multi-Rater Calibrated Segmentation Models cs.CVPDF

Meritxell Riera-Marín, Javier García López, Júlia Rodríguez-Comas, Miguel A. González Ballester, Adrian Galdran

TL;DR: 本文提出了一种通过将多专家标注视为有序学习问题来改进医学图像分割模型概率校准的方法,该方法利用体素级标注者一致性作为有序目标,将预测置信度与训练数据中的经验变异性联系起来,从而在不降低分割准确性的情况下显著提升模型校准性能。

Details

Motivation: 医学图像分割模型在临床决策中需要准确的概率估计,但现有深度分割网络校准效果差,且多专家标注存在显著分歧,传统方法将标注者间变异性视为噪声,而本文旨在利用这种内在标注模糊性来改进模型置信度。

Result: 在涵盖眼科、组织病理学和胸部影像的四个公开分割基准测试中,使用多专家扩展预期校准误差进行评估,结果表明有序感知训练能显著提升模型校准性能,且不降低分割准确性。

Insight: 创新点在于将多专家标注重新定义为有序学习问题,通过结合有序感知评分规则(如排名概率得分有序损失)和标准二值目标,实现了架构无关的、更可靠的概率分割模型校准方法。

Abstract: Objective: Accurate probability estimates are essential for the safe deployment of medical image segmentation models in clinical decision-making. However, modern deep segmentation networks are often poorly calibrated, a problem exacerbated when multiple expert annotations exhibit substantial disagreement. While inter-rater variability is typically treated as noise, it provides valuable information about intrinsic annotation ambiguity that must be reflected in model confidence. Methods: We improve the probabilistic calibration of medical image segmentation models by reformulating multi-rater supervision as an ordinal learning problem. Voxel-wise annotator agreement is treated as an ordered target, linking predictive confidence to the empirical variability in training data. This formulation allows the use of ordinal-aware scoring rules, such as the Ranked Probability Score ordinal loss, combined with a standard binary objective to preserve discriminative performance. Results: We evaluated the proposed approach across four public segmentation benchmarks spanning ophthalmology, histopathology, and thoracic imaging. Calibration was assessed using a multi-rater extension of expected calibration error. Results consistently show that ordinal-aware training yields substantially improved calibration with respect to inter-rater agreement without degrading segmentation accuracy. Conclusions: Treating multi-rater annotations as ordered information provides a principled and architecture-agnostic route to more reliable probabilistic segmentation models.


[141] Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection cs.CV | cs.LGPDF

Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang

TL;DR: 本文提出混合原型流匹配(MPFM)框架,用于解决开放集监督异常检测(OSAD)问题。该方法通过将正常特征分布连续映射到结构化高斯混合原型空间,克服了现有基于原型方法因单模态高斯先验导致的决策边界模糊问题,并引入互信息最大化正则化器(MIMR)增强正常与异常样本的可分离性。

Details

Motivation: 现有基于原型的开放集监督异常检测方法通常采用单模态高斯先验建模正常数据,无法捕捉其内在多模态特性,导致决策边界模糊,限制了检测性能。

Result: 在单异常和多异常设置下的多个基准测试中,MPFM均取得了最先进的性能(SOTA)。

Insight: 创新点在于将流匹配中的速度场显式建模为高斯混合先验,每个分量对应不同的正常类别,实现了模式感知和语义一致的分布传输;同时引入MIMR防止原型坍缩并最大化正常-异常可分性,为多模态正常数据建模提供了新思路。

Abstract: Open-set supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.


[142] StableMind: Source-Free Cross-Subject fMRI Decoding with Regularized Adaptation cs.CVPDF

Jintao Guo, Lin Wang, Shumeng Li, Jian Zhang, Yulin Zhou

TL;DR: 本文提出StableMind,一种用于跨被试fMRI解码的正则化适应框架,旨在解决源数据不可访问且新被试数据有限时性能下降的问题。通过稳定大脑表征和提升图像监督可靠性,在自然场景数据集上实现了优于现有方法的图像检索和大脑检索准确率。

Details

Motivation: 现有跨被试fMRI解码方法通常依赖大量配对fMRI-图像数据进行新被试适应,但在实际场景中,新被试数据有限且历史原始数据不可访问,导致适应性能下降。本文识别出性能下降源于大脑侧表征不稳定和图像侧监督不可靠两个关键问题。

Result: 在自然场景数据集上,采用统一的1小时适应协议,StableMind在四个被试上平均达到84.02%的图像检索准确率和81.66%的大脑检索准确率,大脑检索准确率比现有最优方法提升5.71%,且使用更少的可训练适应参数。

Insight: 创新点包括:1)利用预训练模型的岭投影作为适应先验来约束有限数据下的新被试适应,并结合基于傅里叶的特征级大脑增强以提高对个体差异的鲁棒性;2)引入难度感知的图像模糊进行脑图像对齐,减少有限fMRI信号难以支持的细粒度视觉细节的影响,同时保留稳定的视觉结构。

Abstract: Existing cross-subject fMRI decoding methods typically train a model on multiple scanned subjects and then adapt it to a new subject using substantial paired fMRI-image data. However, in realistic scenarios, new-subject fMRI data are often limited due to costly data acquisition, and raw data from previous subjects may be inaccessible, leading existing methods to suffer performance degradation during new-subject adaptation. In this paper, we identify that this degradation stems from two key issues: brain-side instability caused by large subject differences in fMRI responses, and image-side supervision unreliability caused by fine-grained visual details that are not reliably supported by limited fMRI signals. To address these challenges, we propose StableMind, a regularized adaptation framework designed to improve brain-side representation stability and image-side supervision reliability. (1) To stabilize brain representations, StableMind reuses ridge projections from the pretrained model as adaptation priors to constrain limited-data new-subject adaptation, and applies Fourier-based feature-level brain augmentation to improve robustness to individual variability. (2) To improve image supervision reliability, StableMind introduces difficulty-aware image blur for brain-image alignment, reducing the influence of fine-grained visual details that are weakly supported by limited fMRI signals while preserving stable visual structure. Experiments on the Natural Scenes Dataset under a unified 1-hour adaptation protocol demonstrate that StableMind achieves 84.02% image retrieval accuracy and 81.66% brain retrieval accuracy averaged over four subjects, surpassing the state-of-the-art method by 5.71% brain retrieval accuracy with fewer trainable adaptation parameters. Our code is available at https://github.com/lingeringlight/StableMind.


[143] Representation learning from OCT images cs.CV | cs.LGPDF

Hedi Tabia, Désiré Sidibé, Nawres Khlifa, Ahmed Tabia, Ines Rahmany

TL;DR: 这篇论文是一篇关于视网膜光学相干断层扫描(OCT)图像表示学习方法的系统性综述。它全面回顾了从早期深度学习方法到最新的基础模型和视觉-语言系统的发展,涵盖了监督学习、自监督学习、生成方法、3D建模、多模态学习等多种学习范式,并对公开数据集、评估协议和未来研究方向进行了结构化梳理。

Details

Motivation: OCT已成为眼科最常用的成像方式之一,临床处理大量采集数据的需求推动了其自动化分析。论文旨在通过综述表示学习方法,减少对专家标注的依赖,并提高跨设备和人群的诊断一致性。

Result: 作为一篇综述性论文,未报告具体的定量实验结果,但系统性地分析和比较了不同学习范式的方法论贡献、局限性及其关联。

Insight: 论文的主要创新在于提供了一个结构化的分类法和统一的数学框架来梳理OCT图像表示学习领域,并前瞻性地指出了未来关键研究方向,如体积基础模型预训练、不确定性感知学习、联邦学习、公平性和基于概念的可解释性等。

Abstract: Optical Coherence Tomography (OCT) has become one of the most used imaging modality in ophthalmology. It provides high-resolution, non-invasive visualization of retinal microarchitecture. The automated analysis of OCT images through representation learning has emerged as a central research frontier. This has mainly been driven by the clinical need to process large acquisition volumes. The objective is to reduce the reliance on expert annotation, and improve diagnostic consistency across devices and populations. This survey provides a comprehensive and structured review of representation learning methods for retinal OCT image analysis. It covers the period from early deep learning approaches to the most recent developments in foundation models and vision-language systems. We organize the literature along a principled taxonomy of learning paradigms, encompassing supervised learning with CNN-based and transformer-based architectures, self-supervised and semi-supervised methods, generative approaches, as well as 3D volumetric modeling, multimodal representation learning, and large-scale pretrained foundation models. For each paradigm, we analyze the core methodological contributions, identify persistent limitations, and trace the connections between successive approaches. We further provide a structured overview of publicly available OCT datasets, discuss evaluation protocol considerations, and present a unified problem formulation that situates each learning paradigm within a common mathematical framework. Building on this analysis, we identify and discuss the most pressing open research directions emerging in the literature. This includes volumetric foundation model pretraining, uncertainty-aware representation learning, federated and privacy-preserving training, fairness and bias mitigation, concept-based interpretability,…


[144] Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model cs.CVPDF

Zhou Bingtao, Xiang Mian, Ning Qian

TL;DR: 本文提出了一种更严格的源自由域适应(VODA)设置,完全摆脱对源模型的依赖,仅使用随机初始化模型、视觉语言(ViL)模型和未标记目标数据。作者设计了双阶段去噪区域蒸馏(TS-DRD)框架,先通过ViL引导预热模型,再寻找ViL与适应模型共有的去噪区域以提供更干净的蒸馏监督。在Office-Home、VisDA和DomainNet-126基准测试中,该方法取得了与依赖源模型的现有SFDA方法相当或更优的性能。

Details

Motivation: 现有源自由域适应(SFDA)方法仍依赖源预训练模型初始化,并非真正源自由;研究发现不同源模型对同一目标域的最终结果影响有限,因此提出完全消除源域依赖的更严格VODA设置。

Result: 在Office-Home、VisDA和DomainNet-126基准上,TS-DRD在VODA设置下达到了与使用源模型的现有SFDA方法竞争或更优的性能,验证了方法的有效性。

Insight: 创新点在于提出完全无需源模型的VODA新范式,并通过双阶段去噪区域蒸馏(TS-DRD)利用ViL模型与适应模型的共有区域生成更鲁棒的监督信号,为域适应提供了更纯粹的无源解决方案。

Abstract: Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and thus are not truly source-free. Recent works have introduced Vision-Language (ViL) models to guide the adaptation process, in these methods, we observe that for the same target domain, different source models yield minimal variation in final results, indicating the source model itself has limited impact. Motivated by this, we propose ViL-Only Domain Adaptation (VODA) , a stricter setting that eliminates all dependencies on source domain, relying solely on a randomly initialized model, a ViL model, and unlabeled target data. We analyze the adaptation dynamics of VODA and introduce Two-Stage Denoised-Region Distillation (TS-DRD) , a two-stage framework that first warms up the model with ViL guidance, then seek a Denoised-Region inherent in both the ViL and adapting model, yielding cleaner supervision for distillation. Experiments on Office-Home, VisDA, and DomainNet-126 show that under VODA, TS-DRD achieves competitive or superior performance to existing SFDA methods that still use source models, demonstrating its effectiveness and the potential of the VODA setting.


[145] Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection cs.CVPDF

Morteza Moradi, Mohammad Moradi, Simone Palazzo, Ali Borji, Concetto Spampinato

TL;DR: 本文提出了GLASSNet,一个用于显著目标检测的全局-局部特征解码框架。该方法以冻结的SAMv2作为编码器,并引入轻量级的空间感知卷积适配器,大幅减少了可学习参数。通过双解码器架构分别捕获全局语义和局部细节,融合后生成精确的显著图。

Details

Motivation: 解决在大型视觉模型时代,显著目标检测任务未被充分探索的问题。尽管基础模型(如SAM)泛化能力强,但直接用于SOD潜力未完全发挥,且全微调计算成本高、在小数据下易过拟合。

Result: 在标准显著目标检测和伪装目标检测基准测试上的大量实验表明,GLASSNet超越了最先进的方法。

Insight: 主要创新点在于:1) 使用冻结的基础模型编码器结合极轻量的适配器(减少97%以上参数),实现高效迁移;2) 设计全局-局部双解码器架构,融合长程语义与精细局部线索;3) 展示了冻结基础模型结合针对性适配与全局-局部解码的有效性。

Abstract: Salient Object Detection (SOD) remains an essential yet underexplored task in the era of large-scale vision models. Although foundation models like SAM exhibit strong generalization, their potential for SOD is not fully realized, and training or fully fine-tuning them is computationally expensive and prone to overfitting under limited data. To overcome these challenges, we introduce GLASSNet, a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97%. To enhance saliency quality, GLASSNet employs a dual-decoder architecture: one decoder captures global, long-range semantics with an expanded receptive field, while the other captures fine local details such as edges and textures. Fusing these complementary cues yields saliency maps that combine global coherence with local precision, producing accurate final masks. Extensive experiments on standard SOD and camouflaged object detection benchmarks show that GLASSNet surpasses state-of-the-art methods, demonstrating the power of frozen foundation models combined with targeted adaptation and global-local decoding.


[146] Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval cs.CV | cs.MMPDF

Yiming Ding, Siyu Cao, Luyuan Jiao, Yixuan Li, Zitong Wang

TL;DR: 本文提出了广义时刻检索(GMR)这一新任务,旨在从视频中检索与自然语言查询相关的所有时刻或预测空集,以克服传统视频时刻检索(VMR)仅假设单一匹配时刻的局限性。为此,作者构建了基于足球视频的大规模基准数据集Soccer-GMR,并设计了统一的评估协议。此外,论文还提出了两种建模范式的强基线方法:一种用于判别式VMR模型的轻量级即插即用GMR适配器,以及一种用于微调多模态大语言模型(MLLMs)的GMR定制GRPO奖励。

Details

Motivation: 传统视频时刻检索(VMR)通常假设每个查询只对应一个匹配时刻,但这在现实场景中并不总是成立,因为查询可能对应多个时刻或无匹配时刻。因此,需要一种更通用的设置来检索所有相关时刻或预测空集。

Result: 在提出的Soccer-GMR基准上进行了广泛实验,所提出的基线方法在所有评估指标(包括空集拒绝、正查询定位和端到端GMR性能)上均取得了一致的性能提升,并揭示了当前方法的局限性。

Insight: 创新点包括:1) 提出了广义时刻检索(GMR)这一更现实和具有挑战性的统一任务设置;2) 通过半自动化流程构建了大规模、高质量的Soccer-GMR基准数据集;3) 设计了针对GMR的定制化评估协议和指标;4) 提出了适用于不同建模范式(判别式VMR模型和生成式MLLMs)的强基线方法,为未来研究提供了基础。

Abstract: Video Moment Retrieval (VMR) aims to localize temporal segments in videos that correspond to a natural language query, but typically assumes only a single matching moment for each query. This assumption does not always hold in real-world scenarios, where queries may correspond to multiple or no moments. Thus, we formulate Generalized Moment Retrieval (GMR), a unified setting that requires retrieving the complete set of relevant moments or predicting an empty set. To enable systematic study of GMR, we introduce Soccer-GMR, a large-scale benchmark built on challenging soccer videos that reflect general GMR scenarios, with realistic negative and positive queries. The benchmark is constructed via a duration-flexible semi-automated pipeline with human verification, enabling scalable data generation while maintaining high annotation quality. We further design a unified evaluation protocol with complementary metrics tailored for null-set rejection, positive-query localization, and end-to-end GMR performance. Finally, we establish strong baselines across two modeling paradigms: a lightweight plug-and-play GMR adapter for discriminative VMR models, and a GMR-tailored GRPO reward for fine-tuning multimodal large language models (MLLMs). Extensive experiments show consistent gains across all metrics and expose key limitations of current methods, positioning GMR as a more realistic and challenging benchmark for video-language understanding.


[147] AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding cs.CVPDF

Ruilin Yao, Shegnwu Xiong, Tianyu Zou, Shili Xiong, Yi Rong

TL;DR: 本文提出AutoFocus,一种无需训练、基于不确定性感知的主动视觉搜索框架,用于解决高分辨率GUI界面中视觉语言模型(VLMs)在坐标定位任务上的性能下降问题。该方法通过采样多个坐标假设,将坐标生成过程中的token级困惑度转换为各向异性的高斯空间概率场,从而显式建模方向不确定性,并基于此生成全局和局部区域建议,结合形状感知缩放和视觉提示聚合来提升定位精度。

Details

Motivation: 现有基于视觉语言模型的自主GUI代理在将自然语言指令转换为可执行屏幕坐标时,在高分辨率界面中性能下降,因为密集布局和小型交互元素暴露了现代显示器分辨率与模型输入约束之间的差距。现有放大策略(如固定锚点、启发式网格或强化学习)缺乏自适应确定需要细化的位置以及应探索多少空间不确定性的原则性机制。

Result: 在ScreenSpot-Pro和ScreenSpot-V2基准测试上的广泛实验表明,该方法在通用和GUI专用视觉语言模型上均取得了一致的性能提升。

Insight: 核心创新点在于利用坐标生成过程中的token级困惑度作为空间不确定性的自然反映,并将其建模为各向异性的高斯空间概率场,从而指导自适应的、形状感知的缩放和区域提议。这是一种无需训练、原则性的不确定性感知主动搜索机制,可有效提升GUI grounding的精度和鲁棒性。

Abstract: Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.


[148] ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking cs.CV | cs.AIPDF

Jiawei Ge, Xintian Zhang, Jiuxin Cao, Bo Liu, Fabian Deuser

TL;DR: 本文提出ViewSAM,一个用于弱监督跨视角指代多目标跟踪(CRMOT)的两阶段框架。该框架首先利用基础模型SAM3生成跨视角伪标签,然后训练基于SAM2的ViewSAM模型,通过显式建模视角感知的跨模态语义,实现仅使用类别标签作为粗粒度监督的跨视角指代跟踪。

Details

Motivation: 现有CRMOT方法严重依赖昂贵的逐帧空间标注和跨视角身份监督,本文旨在探索利用基础模型能力,在仅使用对象类别标签作为弱监督的条件下解决跨视角指代跟踪问题,以降低标注成本。

Result: 在弱监督设置下,ViewSAM取得了SOTA性能,并与全监督方法保持竞争力。大量实验验证了其有效性。

Insight: 创新点包括:1) 将基础模型重新用作伪标签生成器,提出亲和力引导的跨视角重提示策略来关联SAM3生成的轨迹段;2) 构建ViewSAM模型,通过将视角引起的变体建模为可学习条件,显式桥接视角变化的视觉观察与视角不变的文本表达之间的语义鸿沟,仅需约10%的额外参数即可实现鲁棒的跨模态跟踪。

Abstract: Cross-view Referring Multi-Object Tracking (CRMOT) aims to track multiple objects specified by natural language across multiple camera views, with globally consistent identities. Despite recent progress, existing methods rely heavily on costly frame-level spatial annotations and cross-view identity supervision. To reduce such reliance, we explore CRMOT under weak supervision by leveraging the capabilities of foundation models. However, our empirical study shows that directly applying foundation models such as SAM2 and SAM3, even with task-specific modifications, fails to accurately understand referring expressions and maintain consistent identities across views. Yet, they remain effective at producing reliable object tracklets that can serve as pseudo supervision. We therefore repurpose foundation models as pseudo-label generators and propose a two-stage framework for weakly supervised CRMOT, using only object category labels as coarse-grained supervision. In the first stage, we design an Affinity-guided Cross-view Re-prompting strategy to refine and associate SAM3-generated tracklets across cameras, producing reliable cross-view pseudo labels for subsequent training. In the second stage, we introduce ViewSAM, a CRMOT model built upon SAM2 that explicitly models view-aware cross-modal semantics. By formulating view-induced variations as learnable conditions, ViewSAM bridges the gap between view-variant visual observations and view-invariant textual expressions, enabling robust cross-view referring tracking with only approximately 10% additional parameters. Extensive experiments demonstrate that ViewSAM achieves SOTA performance under weak supervision and remains competitive with fully supervised methods.


[149] Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE cs.CVPDF

Yangming Shi, Shixiang Zhu, Tao Shen, Zhimiao Yu, Dengsheng Chen

TL;DR: Mamoda2.5是一个统一的AR-Diffusion框架,将多模态理解与生成无缝集成于单一架构中。它通过在Diffusion Transformer主干中引入细粒度混合专家(MoE)设计,构建了一个250亿参数但仅激活30亿参数的模型,显著降低了训练成本。该模型在VBench 2.0上取得了顶级的生成性能,在视频编辑质量上创造了新纪录,并匹配了当前顶级专有模型的性能。此外,通过联合少步蒸馏和强化学习框架,将30步编辑模型压缩为4步模型,极大加速了推理速度,在视频编辑推理上比开源基线快达95.9倍。

Details

Motivation: 为了解决在多模态任务中统一理解与生成、同时降低大规模模型训练成本并提升推理效率的问题。

Result: 在VBench 2.0上取得顶级生成性能;在OpenVE-Bench的视频编辑质量上创下新纪录,超越开源模型,匹配包括Kling O1在内的顶级专有模型;视频编辑推理速度比开源基线快达95.9倍;在内部广告视频编辑场景中达到98%的成功率。

Insight: 主要创新点包括:1) 统一的AR-Diffusion框架集成多模态理解与生成;2) 在Diffusion Transformer中引入细粒度MoE设计(128专家,Top-8路由),实现参数高效扩展;3) 联合少步蒸馏与强化学习框架,显著压缩模型步数并加速推理。这些方法在保持高性能的同时,有效平衡了模型容量、训练成本和推理速度。

Abstract: We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model’s generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.


[150] Human Activity Recognition Method for Moderate Violence Detection cs.CVPDF

Luis Angel Aparicio Borjas, Victor Elias Nieto, Juan Irving Vasquez, Alfonso Fernandez-Vazquez, Gerardo Antonio Alvarez Hernandez

TL;DR: 本研究开发了一种基于计算机视觉的自动化系统,用于实时检测监控视频中的中度身体暴力行为(特别是推搡)。该系统整合了YOLO11和YOLO11-Pose模型进行人体检测和骨骼关键点提取,通过计算身体倾斜度和肩髋关节角度,并训练随机森林分类器来区分正常行为与攻击性身体接触。

Details

Motivation: 解决公共场所物理暴力这一重大公共卫生问题,特别是将推搡等轻微事件作为更严重暴力升级的前兆进行早期检测,以实现干预。

Result: 在三个难度递增的案例研究中评估了系统性能:在受控环境正面视角下,模型精确度达到0.98;在最具挑战性的真实世界高空、陡角监控场景中,尽管存在显著透视畸变和视觉噪声,系统仍保持了0.72的精确度。

Insight: 创新点在于将先进的姿态估计模型(YOLO-Pose)与基于骨骼关键点的几何特征(身体倾斜度、关节角度)相结合,并应用随机森林分类器,为复杂真实监控场景下的早期暴力行为识别提供了一种可行方案。

Abstract: Physical violence in public spaces is a significant public health concern, with minor incidents such as pushing often serving as precursors to more severe escalations. This research develops an automated system for the real-time detection of moderate physical violence, specifically pushing, in surveillance camera footage. The proposed solution integrates state-of-the-art computer vision models, utilizing YOLO11 and YOLO11-Pose for human detection and skeletal keypoint extraction. By calculating body inclination and joint angles between shoulders and hips, a Random Forest classifier was trained to distinguish between normal behavior and aggressive physical contact. The system’s performance was evaluated through three progressive case studies representing increasing levels of difficulty. In controlled environments with frontal views, the model achieved a precision of 0.98. In the most challenging scenario, featuring high-altitude, steep-angle recordings from real-world surveillance infrastructure, the system maintained a precision of 0.72 despite significant perspective distortion and visual noise. These results demonstrate the feasibility of using skeletal analysis for early violence intervention in urban security contexts.


[151] OphMAE: Bridging Volumetric and Planar Imaging with a Foundation Model for Adaptive Ophthalmological Diagnosis cs.CV | cs.AIPDF

Tienyu Chang, Zhen Chen, Renjie Liang, Jinyu Ding, Jie Xu

TL;DR: OphMAE是一种眼科多模态基础模型,通过创新的跨模态融合架构和自适应推理机制,将3D OCT的体数据深度与2D en face OCT的平面上下文信息相结合,在17种不同诊断任务上实现了SOTA性能,并展现出强大的工程适应性,即使在仅使用2D单模态输入或少量标注数据时也能保持高诊断准确率。

Details

Motivation: 当前眼科AI范式主要局限于单模态推理,与临床实践中依赖多种互补成像模式进行诊断的现实存在脱节,且在资源有限的环境中,高性能AI的部署常因缺乏先进3D成像硬件而受阻。

Result: 在包含48,340对OCT图像的严格基准测试中,模型在年龄相关性黄斑变性(AMD)和糖尿病性黄斑水肿(DME)上分别取得了96.9%和97.2%的AUC,一致超越了现有的单模态和多模态基础模型。即使在仅使用2D输入时,AMD的AUC仍达93.7%,并且在仅使用500个标注样本时仍能保持95.7%的AUC。

Insight: 论文的核心创新点在于提出了一个新颖的跨模态融合架构和独特的自适应推理机制,能够有效整合3D和2D眼科成像信息,从而构建了一个可扩展且适应性强的眼科AI框架,确保了在不同任务和资源条件下的鲁棒性能。

Abstract: The advent of foundation models has heralded a new era in medical artificial intelligence (AI), enabling the extraction of generalizable representations from large-scale unlabeled datasets. However, current ophthalmic AI paradigms are predominantly constrained to single-modality inference, thereby creating a dissonance with clinical practice where diagnosis relies on the synthesis of complementary imaging modalities. Furthermore, the deployment of high-performance AI in resource-limited settings is frequently impeded by the unavailability of advanced three-dimensional imaging hardware. Here, we present the Ophthalmic multimodal Masked Autoencoder (OphMAE), a multi-imaging foundation model engineered to synergize the volumetric depth of 3D Optical Coherence Tomography (OCT) with the planar context of 2D en face OCT. By implementing a novel cross-modal fusion architecture and a unique adaptive inference mechanism, OphMAE was pre-trained on a massive dataset with of 183,875 paired OCT images derived from 32,765 patients. In a rigorous benchmark encompassing 17 diverse diagnostic tasks with 48,340 paired OCT images from 8,191 patients, the model demonstrated state-of-the-art performance, achieving an Area Under the Curve (AUC) of 96.9% for Age-related Macular Degeneration (AMD) and 97.2% for Diabetic Macular Edema (DME), consistently surpassing existing single-modal and multimodal foundation models. Crucially, OphMAE exhibits robust engineering adaptability: it maintains high diagnostic accuracy, such as 93.7% AUC for AMD, even when restricted to single-modality 2D inputs, and demonstrates exceptional data efficiency by retaining 95.7% AUC with as few as 500 labeled samples. This work establishes a scalable and adaptable framework for ophthalmic AI, ensuring robust performance across different tasks.


[152] Perceptual Flow Network for Visually Grounded Reasoning cs.CV | cs.AIPDF

Yangfu Li, Yuning Gong, Hongjian Zhan, Teng Li, Yuanhuiyi Lyu

TL;DR: 本文提出感知流网络(PFlowNet),旨在解决大型视觉语言模型(LVLMs)中因标准优化目标(如MLE)无法约束视觉轨迹而导致的语言偏见和幻觉问题。该方法通过解耦感知与推理,建立自条件生成过程,并利用变分强化学习整合多维奖励与邻近几何塑造,以促进面向推理的感知行为并保持视觉可靠性。

Details

Motivation: 现有方法引入视觉专家的几何先验作为额外监督,但这类监督偏向几何精度且推理效用有限,因此需要一种更优的监督机制来弥合视觉可靠性与推理有效性之间的差距。

Result: PFlowNet在V* Bench上达到90.6%,在MME-RealWorld-lite上达到67.0%,创造了新的SOTA记录,证明了其性能保证和竞争力。

Insight: 创新点在于解耦感知与推理以实现自条件生成,并通过变分强化学习结合多维奖励与邻近几何塑造,从而在避免僵化对齐专家先验的同时,实现可解释且更有效的视觉推理。

Abstract: Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).


[153] Virtual Scanning for NSCLC Histology: Investigating the Discriminatory Power of Synthetic PET cs.CV | cs.AIPDF

Fatih Aksu, Laura Ciuffetti, Francesco Di Feola, Filippo Ruffini, Giulia Romoli

TL;DR: 本文提出了一种用于非小细胞肺癌(NSCLC)组织学亚型(腺癌与鳞状细胞癌)分类的“虚拟扫描”方法。该方法利用预训练的3D Pix2Pix GAN从解剖CT扫描中合成伪PET代谢图像,并通过多阶段中间融合(MINT)框架将合成PET特征与原始CT特征结合,以增强分类性能。

Details

Motivation: 解决在NSCLC个性化治疗中,准确区分ADC和SCC亚型至关重要,而标准的[$^{18}$F]FDG PET/CT检查因成本高和辐射暴露限制了其广泛应用。本文旨在探索是否可以通过合成PET数据来补充解剖CT扫描,从而在不进行实际PET扫描的情况下,提供互补的代谢特征以改善分类。

Result: 在包含714名受试者的多中心数据集上的实验表明,与仅使用CT的基线相比,引入合成代谢特征显著提升了分类性能。多模态方法将曲线下面积(AUC)从0.489显著提高到0.591,几何平均数(GMean)从0.305提高到0.524。

Insight: 论文的创新点在于提出了一种“虚拟扫描”的特征增强策略,利用生成对抗网络从易获取的CT数据中合成具有判别性的代谢特征(伪PET),从而在物理PET扫描不可用时,为深度学习模型提供互补的跨模态信息。这为临床场景提供了一种低成本、低风险的潜在解决方案。

Abstract: Accurate histological differentiation between adenocarcinoma (ADC) and squamous cell carcinoma (SCC) is critical for personalized treatment in non-small cell lung cancer (NSCLC). While [$^{18}$F]FDG PET/CT is a standard tool for the clinical evaluation of lung cancer, its utility is often limited by high costs and radiation exposure. In this paper, we investigate the feasibility of “virtual scanning” as a feature-enhancement strategy by evaluating whether synthetic PET data can provide complementary feature representations to supplement anatomical CT scans in histological subtype classification. We propose a framework that leverages a 3D Pix2Pix Generative Adversarial Network (GAN), pretrained on the FDG-PET/CT Lesions dataset, to synthesize pseudo-PET volumes from anatomical CT scans. These synthetic volumes are integrated with structural CT data within the MINT framework, a multi-stage intermediate fusion architecture. Our experiments, conducted on a multi-center dataset of 714 subjects, demonstrate that the inclusion of synthetic metabolic features significantly improves classification performance over a CT-only baseline. The multimodal approach achieved a statistically significant increase in the Area Under the Curve (AUC) from 0.489 to 0.591 and improved the Geometric Mean (GMean) from 0.305 to 0.524. These results suggest that synthetic PET scans provide discriminatory metabolic cues that enable deep learning models to exploit complementary cross-modal information, offering a potential feature-enhancement strategy for clinical scenarios where physical PET scans are unavailable.


[154] Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation cs.CV | cs.ROPDF

Chenyu Hui, Xiaodi Huang, Siyu Xu, Yunke Wang, Shan You

TL;DR: 本文提出了一种高效的视频增强框架,用于将模拟的视觉-语言-动作(VLA)视频转换为逼真的训练视频,以弥合模拟与真实世界之间的视觉域差距。该方法通过视频语义分割和视频描述提取结构化条件,重写描述以增加环境多样性,并利用条件视频传输模型合成逼真视频。为了提高大规模增强的实用性,引入了扩散特征重用机制以加速生成,以及核心集采样策略以在有限计算下选择紧凑、非冗余的子集进行增强。

Details

Motivation: 解决VLA模型依赖大规模真实世界视频数据,而模拟数据因视觉域差距大和环境多样性有限导致真实世界泛化能力弱的问题,旨在利用低成本、可并行收集的模拟数据生成逼真训练视频。

Result: 在Robotwin 2.0、LIBERO、LIBERO-Plus和真实机器人平台上进行了广泛实验,均显示出性能的持续提升。例如,在Robotwin 2.0上将RDT-1B模型提升了8%,在更具挑战性的LIBERO-Plus基准上将$π_0$提升了5.1%。

Insight: 创新点包括:1) 从模拟视频中提取结构化条件(语义分割和描述)并重写描述以增强环境多样性;2) 引入扩散特征重用机制加速视频生成;3) 提出核心集采样策略优化计算资源下的数据选择。这些方法有效提升了模拟数据到真实世界的迁移效率和模型泛化能力。

Abstract: Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize realistic videos. To make augmentation practical at scale, we introduce a diffusion feature-reuse mechanism that reuses video tokens across adjacent timesteps to accelerate generation, and a coreset sampling strategy that identifies a compact, non-redundant subset for augmentation under limited computation. Extensive experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform demonstrate consistent improvements. For example, our method improves RDT-1B by 8% on Robotwin 2.0, and boosts $π_0$ by 5.1% on the more challenging LIBERO-Plus benchmark. Code is available at: https://github.com/nanfangxiansheng/Seeing-Realism-from-Simulation.


[155] FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation cs.CVPDF

Hsin-Jui Pan, Sheng-Wei Chan, Meng-Qian Li, Chun-Po Shen

TL;DR: 本文提出FoR-Net,一种轻量级语义分割架构,其核心思想是通过学习重要性图和Top-K激活机制,有选择性地聚焦于图像中的困难区域(如细长结构和物体边界),而非依赖繁重的全局建模。该模型使用选择器模块预测区域重要性,并采用多尺度卷积分支聚合不同感受野的上下文信息。

Details

Motivation: 解决在有限计算资源下进行高效语义分割时,如何更有效地处理困难区域(如物体边界和细结构)的问题,避免使用计算成本高的全局建模方法。

Result: 在Cityscapes基准测试上,尽管采用轻量级设计和标准训练配置,FoR-Net取得了有竞争力的性能,并在困难区域表现出更好的一致性。

Insight: 创新点在于引入区域聚焦推理作为高效的归纳偏置,通过预测重要性图与Top-K激活选择性增强信息区域;客观来看,这种将计算资源动态分配给困难区域的策略为轻量级模型设计提供了新思路。

Abstract: We present FoR-Net, a lightweight architecture for semantic segmentation that focuses on identifying and enhancing hard regions. Instead of relying on heavy global modeling, FoR-Net adopts an efficient strategy that selectively emphasizes informative regions through a learned importance map and a Top-K activation mechanism. Specifically, a selector module predicts region-wise importance, enabling the model to focus on challenging areas such as thin structures and object boundaries. Multi-scale reasoning is achieved using convolutional branches with different receptive fields, allowing diverse spatial context aggregation. We evaluate FoR-Net on the Cityscapes benchmark under limited computational resources. Despite its lightweight design and standard training configuration, FoR-Net achieves competitive performance and demonstrates improved consistency in challenging regions. These results suggest that region-focused reasoning provides a simple yet effective inductive bias for efficient semantic segmentation.


[156] Linearizing Vision Transformer with Test-Time Training cs.CVPDF

Yining Li, Dongchen Han, Zeyu Liu, Hanyi Wang, Yulin Wang

TL;DR: 本文提出了一种将预训练的Vision Transformer线性化的方法,通过测试时训练(TTT)架构直接继承Softmax注意力权重,并引入关键实例归一化和轻量级局部性增强模块来对齐表示属性,从而在保持图像生成质量的同时显著提升推理速度。

Details

Motivation: 解决线性复杂度注意力机制因与Softmax注意力存在表示鸿沟而无法有效继承预训练权重的问题,旨在实现高效且高质量的Transformer模型线性化转换。

Result: 在Stable Diffusion 3.5上线性化得到的SD3.5-T^5模型,仅需4块H20 GPU上1小时的微调,在1K和2K分辨率下分别实现1.32倍和1.47倍的推理加速,且文本到图像生成质量与微调后的Softmax模型相当。

Insight: 通过架构对齐(利用TTT的两层动态公式与Softmax注意力结构对齐)和表示对齐(关键实例归一化与局部性增强)的双重策略,实现了预训练权重到线性注意力模型的有效迁移,为高效Transformer部署提供了新思路。

Abstract: While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T$^5$ (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4$\times$H20 GPUs, SD3.5-T$^5$ achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32$\times$ and 1.47$\times$ at 1K and 2K resolutions.


[157] HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar cs.CVPDF

Yeheng Zong, Pou-Chun Kung, Yike Pan, Seth Isaacson, Yizhou Chen

TL;DR: 本文提出HumanSplatHMR,一个联合优化框架,旨在解决从视频中重建高保真人体化身时三维姿态恢复不准确的问题。该方法通过将几何姿态估计与可微分渲染(基于高斯泼溅)形成闭环,同时优化人体姿态和学习化身模型,以实现新视角和新姿态下的高质量合成。

Details

Motivation: 现有方法(如ViT、NeRF或高斯泼溅化身)在从视频恢复人体姿态和外观时存在局限:ViT方法可能过拟合2D视图而不稳定,而基于NeRF/高斯泼溅的化身将姿态与外观解耦处理,限制了在新姿态下的渲染泛化能力。本文旨在解决这些缺陷,在非受控场景下仅使用姿态估计器的输出,联合优化姿态与化身。

Result: 实验表明,该方法在姿态恢复和化身重建方面均优于基线方法。具体而言,它超越了省略图像级细化的姿态恢复基线,以及将姿态估计与化身重建解耦的化身基线,在精度和对齐度上取得了一致的改进。

Insight: 核心创新点在于将几何姿态估计与可微分渲染形成闭环优化,通过反向传播光度、分割和深度损失到姿态参数和全局位置,联合细化三维姿态并学习化身。这避免了依赖不切实际的高精度运动捕捉数据,仅使用现成姿态估计器的输出,更贴近真实世界条件,从而提升了姿态准确性和渲染质量。

Abstract: Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplatHMR, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplatHMR backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.


[158] IConFace: Identity-Structure Asymmetric Conditioning for Unified Reference-Aware Face Restoration cs.CV | cs.AIPDF

Axi Niu, Jinyang Zhang, Senyan Qing

TL;DR: 本文提出了IConFace,一个统一的参考感知和无参考人脸恢复框架。它通过身份-结构非对称条件化,将参考图像提炼为全局身份锚点进行图像调制,同时将退化图像强化为空间结构锚点。该单一模型可根据参考图像是否可用,灵活切换模式,在提升身份一致性、细节恢复和无参考恢复质量方面表现优异。

Details

Motivation: 解决严重退化下人脸恢复的病态问题,即退化输入可能丢失关键身份细节。现有基于同身份参考的方法容易因姿态、表情等不匹配而过度使用参考外观,需要一种能统一利用参考信息并在无参考时有效恢复的框架。

Result: 论文表明,该方法在身份一致性、精细细节恢复以及纯退化图像恢复质量上均有提升,实现了一个统一的单一检查点模型。

Insight: 核心创新点是身份-结构非对称条件化机制:将参考信息压缩为全局身份锚点(AdaFace),而将退化图像作为空间结构锚点进行强化(通过低秩残差和块级退化交叉注意力),实现了参考信息与结构信息的解耦与高效融合。

Abstract: Blind face restoration is highly ill-posed under severe degradation, where identity-critical details may be missing from the degraded input. Same-identity references reduce this ambiguity, but mismatched pose, expression, illumination, age, makeup, or local facial states can lead to overuse of reference appearance. We propose \textbf{IConFace}, a unified reference-aware and no-reference framework with identity–structure asymmetric conditioning. References are distilled into a norm-weighted global AdaFace identity anchor for image-only modulation, while the degraded image is reinforced as the spatial structure anchor through low-rank residuals and block-wise degraded cross-attention with two-route memory. The resulting single checkpoint exploits references when available and falls back to no-reference restoration when absent, improving identity consistency, fine-detail recovery, and degraded-only restoration quality in a unified model.


[159] VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition cs.CV | cs.LGPDF

Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi

TL;DR: 本文介绍了VideoNet,一个针对领域特定动作识别的大规模数据集,包含37个领域的1000个不同动作。论文通过多项选择、二分类和少样本设置评估了现有视觉语言模型(VLM)的性能,发现模型在利用上下文示例方面存在困难。通过收集近50万个视频问答对进行微调,微调后的Molmo2-4B模型在VideoNet基准测试中超越了所有8B参数的开源模型。

Details

Motivation: 由于缺乏足够多样化和具有挑战性的数据,现代视觉语言模型(VLM)在动作识别能力上未得到充分评估。为了在VLM时代复兴动作识别研究,论文主张重新关注领域特定动作,并为此构建了VideoNet数据集。

Result: 在VideoNet基准测试中,Gemini 3.1 Pro在多项选择设置中达到69.9%准确率,而Qwen3-VL-8B仅为45.0%。在二分类设置中,Qwen准确率为59.2%。在少样本设置下,Qwen提升+7.0%,Gemini下降-4.8%,均低于人类在相同设置下+13.6%的提升。微调后的Molmo2-4B模型超越了所有8B参数的开源模型。

Insight: 论文的创新点在于构建了首个大规模领域特定动作识别数据集VideoNet,并系统评估了VLM在多种设置下的性能,揭示了模型在利用上下文示例方面的局限性。通过微调实验,证明了专用训练数据对提升模型动作识别能力的重要性,为领域特定动作识别研究提供了新的基准和方向。

Abstract: Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in{1,2,3}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0%$, while Gemini declines $-4.8%$. Notably, these gains fall short of the $+13.6%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.


[160] Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion cs.CVPDF

Amirhosein Javadi, Shirin Saeedi Bidokhti, Tara Javidi

TL;DR: 本文提出了ActDiff-VC,一种基于扩散模型的超低码率视频压缩框架。该方法将视频分割为变长片段,仅在需要时传输关键帧,并使用一组紧凑的跟踪点轨迹来总结时间动态。在接收到这些稀疏条件信号后,一个条件扩散解码器合成剩余帧,从而在严格的码率限制下实现感知上真实的视频重建。

Details

Motivation: 扩散模型为超低码率下的感知重建提供了强大的生成先验,但有效的视频压缩需要利用高度紧凑的条件信号来控制生成过程。

Result: 在UVG和MCL-JCV基准测试上,ActDiff-VC在相同NIQE下实现了高达64.6%的码率节省;在可比码率下,相比强大的学习型编解码器,KID提升了高达64.6%,FID提升了高达37.7%;在超低码率下,相比学习型和基于扩散的基线方法,提供了更优的感知率失真权衡。

Insight: 核心创新在于提出了内容自适应的关键帧选择与预算感知的稀疏轨迹选择两种机制,共同为生成式重建提供了紧凑而有效的条件信号。这实现了对扩散生成过程的精确控制,以极低的比特开销高效地传递视频内容与动态信息,是生成式视频压缩在超低码率下的有效探索。

Abstract: Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE, improves KID by up to 64.6% and FID by up to 37.7% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate–distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.


[161] AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion cs.CV | cs.IRPDF

Yu-Ju Tsai, Brian Price, Qing Liu, Luis Figueroa, Daniil Pakhomov

TL;DR: AlbumFill是一个无需训练的框架,用于个性化图像补全,通过从个人相册中检索身份一致的参考图像来恢复遮挡区域。它利用视觉语言模型推断缺失的语义线索以指导检索,并使用基于参考的补全模型进行修复。论文还引入了一个包含54K以人为中心样本的数据集来支持该任务。

Details

Motivation: 解决个性化图像补全中现有方法依赖通用修复模型导致身份不一致,或假设显式提供合适参考图像的问题,实际中需要从个人相册中自动搜索身份一致的图像。

Result: 在多个基线方法上的实验表明,个性化补全任务具有挑战性,并强调了身份一致参考检索的重要性,但摘要未提及具体定量结果或SOTA比较。

Insight: 创新点在于训练免费的框架,结合视觉语言模型进行语义推理和检索,从个人相册中自动获取参考图像,以提升身份一致性;客观分析认为其数据集构建和检索引导机制是可借鉴的实践。

Abstract: Personalized image completion aims to restore occluded regions in personal photos while preserving identity and appearance. Existing methods either rely on generic inpainting models that often fail to maintain identity consistency, or assume that suitable reference images are explicitly provided. In practice, suitable references are often not explicitly provided, requiring the system to search for identity-consistent images within personal photo collections. We present AlbumFill, a training-free framework that retrieves identity-consistent references from personal albums for personalized completion. Given an occluded image and a personal album, a vision-language model infers missing semantic cues to guide composed image retrieval, and the retrieved references are used by reference-based completion models. To facilitate this task, we introduce a dataset containing 54K human-centric samples with associated album images. Experiments across multiple baselines demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval. Project Page: https://liagm.github.io/AlbumFill/


cs.HC [Back]

[162] EduGage: Methods and Dataset for Sensor-Based Momentary Assessment of Engagement in Self-Guided Video Learning cs.HC | cs.CVPDF

Zikang Leng, Edan Eyal, Yingtian Shi, Jiaman He, Yaqi Liu

TL;DR: 该研究开发了EduGage系统,通过可穿戴和摄像头传感器收集生理与运动信号(如PPG、ECG、EDA、EEG、IMU、心率、温度和眼动数据),以估计学习者在视频学习中的投入度。研究包含16名参与者的用户实验,收集了多模态传感器数据和即时自我报告的投入度标签,并构建了一个用于细粒度投入度建模的数据集。

Details

Motivation: 在线和视频学习环境中,学习者需要自我调节与教学材料的互动,测量和反思投入度对学习者和自适应学习系统均有支持作用。本研究旨在探索利用多模态传感器数据来估计学习投入度的可行性和有效性。

Result: 在基于参与者的交叉验证中,模型取得了MAE为0.81、83.75%的within-1准确率、73.93%的二元准确率和68.45%的二元Macro-F1分数,性能优于无传感器、统计、深度时序、基础模型和基于LLM的基线方法。

Insight: 研究创新点在于构建了首个包含同步多模态传感器信号和即时投入度标签的EduGage数据集,支持可重复的细粒度投入度建模研究。客观分析表明,实用系统应优先考虑轻量级的行为与生理信号组合,而非完全的多模态仪器化,这为实际应用提供了重要设计启示。

Abstract: Engagement, which links to attentional, emotional, and cognitive dimensions, plays an important role in learning. In online and video-based learning environments, learners often need to regulate their own interactions with instructional materials. Measuring and reflecting on engagement can therefore support both learners and adaptive learning systems. In this study, we use wearable and camera-based sensing devices to collect physiological and motion signals, including PPG, ECG, EDA, EEG, IMU, heart rate, temperature, and eye-tracking data, to estimate learner engagement. We conducted a user study with 16 participants in a video-based learning scenario, where participants completed learning tasks and provided repeated in-situ self-reports of engagement through brief probes. We develop and evaluate a system for engagement estimation, compare different sensing modalities, and further analyze the feasibility and effectiveness of multimodal modeling for characterizing learner engagement. Across participant-based cross-validation, our model achieves an MAE of 0.81, 83.75% within-1 accuracy, 73.93% binary accuracy, and 68.45% binary Macro-F1, outperforming sensor-free, statistical, deep temporal, foundation-model, and LLM-based baselines. Our results suggest that fine-grained engagement estimation is feasible but inherently noisy, and that practical systems should prioritize lightweight combinations of behavioral and physiological signals over full multimodal instrumentation. We release the EduGage dataset, including synchronized multimodal sensor signals, probe-aligned momentary engagement labels, video metadata, quizzes, and study materials, to support reproducible research on fine-grained sensor-based engagement modeling in self-guided learning.


cs.DB [Back]

[163] Graph Query Generation with Constraint-guided Large Language Agents cs.DB | cs.AI | cs.CLPDF

Mengying Wang, Nicolaas Jedema, Rahul Pandey, RaviKiran Krishnan, Jens Lehmann

TL;DR: 本文提出了一种名为UniQGen的新型约束引导框架,利用大型语言模型(LLM)智能体动态提取和精炼代表性的图查询子句,以生成跨查询语言(如Cypher)的可执行、意图对齐的图查询,旨在解决知识图谱问答(KGQA)中针对属性图和Cypher查询语言研究不足的问题。

Details

Motivation: 当前知识图谱问答(KGQA)的研究主要集中在RDF/SPARQL上,而工业界对统一KGQA的需求日益增长,Cypher和属性图却未被充分探索,因此需要一种能够跨查询语言生成高质量图查询的方法。

Result: 在GraphQ、GrailQA和WebQSP等流行KGQA基准测试上,UniQGen在准确性和效率方面均优于最先进的图查询生成技术,在GraphQ上F1分数提升了31.6%,在GrailQA上提升了4.9%。

Insight: 创新点在于将Chase & Backchase查询优化算法扩展为动态推理过程,结合LLM进行查询质量估计,无需针对模式匹配进行微调,从而增强了在无模式图和复杂查询语义中的可扩展性,更适合企业级KGQA应用。

Abstract: Knowledge Graph Question Answering (KGQA) has advanced through structured query generation, yet most efforts target RDF/SPARQL, leaving Cypher and property graphs underexplored, despite increasing demand for unified KGQA in industry settings. We propose UniQGen, a novel constraint-based framework that employs LLM agents to dynamically extract and refine representative graph query clauses into executable, intent-aligned graph queries across query languages. The foundation of our method is a variant of Chase & Backchase, a family of algorithms for query optimization and reformulation. We extend Chase & Backchase with a dynamic reasoning process over query constraints that also interact with LLMs for query quality estimation. With a Cypher-supported Freebase graph deployed on Amazon Neptune, we extensively evaluate our approach on popular KGQA benchmarks (GraphQ, GrailQA, and WebQSP). We demonstrate that UniQGen outperforms state-of-the-art graph query generation techniques in both accuracy and efficiency, with F1 gains of 31.6% on GraphQ and 4.9% on GrailQA. Unlike prior methods, our framework does not require fine-tuning for schema matching, making it more extensible to schema-less graphs and semantics in query workloads, and is more suitable for enterprise-grade KGQA. We release Cypher outputs and a Neptune-ready Freebase snapshot to support reproducible, cross-language KGQA research.


eess.SP [Back]

[164] NAKUL-Med: Spectral-Graph State Space Models with Dynamics Kernels for Medical Signals eess.SP | cs.AI | cs.CV | cs.LGPDF

Badri N. Patro, Vijay S. Agneeswaran

TL;DR: 本文提出了NAKUL模型,一种用于医学信号分析的改进状态空间模型。它通过动态核生成、谱上下文建模和图引导空间注意力三个核心创新,解决了传统SSM在处理多通道生理信号时在时间动态捕捉、全局上下文建模和空间拓扑利用方面的不足。

Details

Motivation: 传统状态空间模型在处理多通道生理信号时面临三个主要限制:固定核无法捕捉多尺度时间动态,马尔可夫状态更新限制了周期性振荡的全局上下文建模,以及通道独立处理忽略了电极的空间拓扑结构。

Result: 在BCI Competition IV-2a运动想象基准测试中,NAKUL达到了91.7±0.6%的准确率,与EEG-Conformer(92.1±0.7%)性能相当,但参数量减少了28%(2.5M vs 3.5M),推理速度快了2.0倍(4.3ms vs 8.7ms)。模型在EEG情绪识别、多模态EEG-fMRI和医学超声成像等任务上也表现出良好的泛化能力。

Insight: 主要创新点包括:1)通过元网络加权不同核大小的并行分支实现自适应时间尺度选择的动态核生成;2)利用基于FFT和可学习高斯频带滤波器的谱上下文建模来高效捕捉全局周期性模式;3)利用固定的电极拓扑为多头注意力提供空间偏置,实现有原则的跨通道交互。消融实验表明动态核贡献了+2.6%的性能提升,并显示出与已知神经动力学相关的可解释尺度选择模式。

Abstract: State space models (SSMs) achieve linear-time complexity but struggle with multi-channel physiological signals due to three limitations: fixed kernels cannot capture multi-scale temporal dynamics (motor preparation over hundreds of milliseconds vs. execution transients in tens of milliseconds), Markovian state updates restrict global context for periodic oscillations, and channel-independent processing ignores spatial electrode topology. We introduce NAKUL, extending SSMs for medical signal analysis through three contributions: (1) Dynamic Kernel Generation-parallel SSM branches with varying kernel sizes (3, 5, 7, 11 timesteps) are weighted by a meta-network that analyzes input statistics, enabling adaptive temporal scale selection; (2) Spectral Context Modeling-FFT-based operations with learnable Gaussian frequency band filters capture global periodic patterns in $O(N \log N)$ complexity; (3) Graph-Guided Spatial Attention-fixed electrode topology provides spatial biases to multi-head attention for principled cross-channel interaction. On BCI Competition IV-2a motor imagery (our primary benchmark), NAKUL achieves 91.7$\pm$0.6% accuracy, matching EEG-Conformer (92.1$\pm$0.7%) while using 28% fewer parameters (2.5M vs 3.5M) and 2.0$\times$ faster inference (4.3ms vs 8.7ms). The model generalizes to EEG emotion recognition (83.6%), multimodal EEG-fMRI (91.4%), and medical imaging (92.8% on ultrasound), demonstrating architectural versatility. Ablations show dynamic kernels contribute +2.6% and exhibit interpretable scale selection patterns correlated with known neural dynamics.


math.OC [Back]

[165] Quaternion Nonlinear Transform-Induced Nuclear Norm for Low-Rank Tensor Completion math.OC | cs.CV | math.NAPDF

Biswarup Karmakar, Ratikanta Behera

TL;DR: 本文提出了一种基于四元数非线性变换诱导的张量核范数(QNTTNN)方法,用于低秩张量补全,特别针对彩色图像和视频数据。该方法通过四元数的实数嵌入,解决了四元数域中非线性变换核范数定义的挑战,并开发了具有收敛保证的优化算法。在基准彩色视频修复数据集上的实验表明,该方法优于现有方法。

Details

Motivation: 现有基于非线性变换的张量核范数(NTTNN)方法仅限于实值张量,无法处理四元数数据,而四元数对于建模彩色图像和视频中的通道间依赖关系至关重要。由于四元数乘法的非交换性和奇异值分解的复杂性,将非线性TNN扩展到四元数域具有挑战性。

Result: 在基准彩色视频修复数据集上的大量实验验证了所提方法优于现有方法,取得了优越的性能。

Insight: 创新点在于通过四元数的实数嵌入,提出了可处理的四元数非线性变换诱导张量核范数(QNTTNN),解决了四元数域中非线性核范数定义的难题,并设计了具有收敛保证的优化算法,扩展了低秩张量补全方法对彩色多维信号的处理能力。

Abstract: Tensor completion has emerged as a powerful framework for recovering missing data in multidimensional signals by exploiting low-rank tensor structures. Among existing approaches, linear transform-based tensor nuclear norm (TNN) methods have achieved considerable success by enforcing low-rankness on transformed frontal slices. However, the low-rank structure revealed by linear transforms remains inherently limited. To better capture intrinsic correlations, nonlinear transform-based TNN (NTTNN) models have been proposed, significantly enhancing low-rank representation through composite transforms. Despite their effectiveness, existing NTTNN methods are restricted to real-valued tensors and fail to model quaternion-valued data, which are essential for preserving inter-channel dependencies in color images and videos. Extending nonlinear TNN models to the quaternion domain is challenging due to the non-commutativity of quaternion multiplication and the complexity of quaternion singular value decomposition. To address the limitations encountered in prior works, we propose a quaternion nonlinear transform-induced tensor nuclear norm (QNTTNN) via a real embedding of quaternions, enabling tractable nuclear norm definitions and efficient optimization. Building upon QNTTNN, we formulate a quaternion tensor completion model and develop a proximal alternating minimization algorithm with rigorous convergence guarantees. Extensive experiments on benchmark color video inpainting datasets validate the superior performance of the proposed method over existing approaches.


cs.IR [Back]

[166] Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis cs.IR | cs.CV | cs.LG | cs.MMPDF

Maria-Eirini Pegia, Dimitrios Stefanopoulos, Björn Þór Jónsson, Anastasia Moumtzidou, Ilias Gialampoukidis

TL;DR: 本文对文本到视频检索领域的性能瓶颈进行了全面的实证和语言分析,评估了14种最先进的方法在3个常用数据集上的表现,并分析了查询文本的特征(如长度、清晰度、语义类别、动作与场景平衡)与模型性能之间的关系。研究发现,简短、清晰、简单的查询(如描述单一动作或颜色属性)召回率更高,而复杂事件、多步骤活动或细粒度场景描述对所有现有模型都构成挑战。注意力驱动架构能更好地处理时间依赖或多步骤查询,而双编码器和多模态融合模型主要在简单或单一类别查询上表现良好。跨数据集泛化能力随着更大、更多样化的查询集而提升,但生成式查询并不能持续提高检索精度。

Details

Motivation: 尽管文本到视频检索领域在过去六年中涌现了众多方法(如双编码器、注意力驱动模型、多模态融合方法),但关于模型行为、数据集影响和查询难度的基本问题仍未得到解答。本研究旨在通过统一的评估框架,深入理解现有方法的性能瓶颈及其与查询语言特征的关系。

Result: 在MSR-VTT、MSVD和ActivityNet三个广泛使用的基准数据集上,对14种SOTA方法进行了评估。结果表明,所有模型在简单查询(如单一动作)上表现良好,但在复杂查询(如多步骤活动)上性能显著下降,揭示了当前基准的局限性。注意力模型在时序依赖查询上表现更优,而双编码器在简单查询上效率更高。

Insight: 论文的创新点在于首次系统性地将查询文本的语言学特征(长度、清晰度、语义类别、动作/场景平衡)与模型性能关联分析,揭示了查询复杂性是当前性能瓶颈的关键因素。从客观角度看,该研究为未来数据集构建(需包含更多复杂、细粒度查询)和模型设计(需针对复杂语义和时序推理进行优化)提供了明确的实证指导,而非仅仅追求整体指标提升。

Abstract: Text-to-video retrieval enables users to find relevant video content using natural language queries, a task that has grown increasingly important with the rapid expansion of online video. Over the past six years, research has produced numerous methods, such as dual encoders, attention-driven models, and multimodal fusion approaches; however, fundamental questions remain about model behavior, dataset influence, and query difficulty. In this work, we evaluate 14 state-of-the-art retrieval methods across 3 widely used datasets under a unified preprocessing and evaluation framework. We analyze caption characteristics, including length, clarity, semantic category, and Action vs. Scene balance, and link these to model performance. Our results show that short, clear, and simple captions, such as those describing single actions or color attributes, achieve higher recall, while complex events, multi-step activities, or fine-grained scene descriptions remain challenging for all existing models. Attention-driven architectures better handle temporally dependent or multi-step queries, whereas dual-encoder and multimodal fusion models perform well primarily on simpler or single-category captions. Cross-dataset generalization improves with larger, more diverse caption sets, but generative captions do not consistently enhance retrieval accuracy. Overall, our findings highlight key dataset factors, benchmark challenges, and the interplay between query content and model architecture, providing guidance for developing more effective text-to-video retrieval systems.


[167] Interactive Multi-Turn Retrieval for Health Videos cs.IR | cs.CV | cs.MMPDF

Chengzheng Wu, Ke Qiu, Baoming Zhang, Ruiyu Mao, Xulong Tang

TL;DR: 本文针对健康视频检索提出了一种交互式多轮检索方法,通过构建MHVRC多轮健康视频检索语料库,并设计了DATR对话感知两阶段检索框架,结合粗检索和重排序,有效提升了多轮查询下的视频检索性能。

Details

Motivation: 现有健康视频检索系统多为单轮交互,难以应对用户初始信息需求模糊、需通过多轮约束(如姿势、禁忌症等)细化的临床场景,因此需要开发支持多轮交互的检索系统。

Result: 在MHVRC语料库上的实验表明,DATR框架相比强文本-视频检索基线模型取得了持续的性能提升,用户研究也证实多轮细化查询比单轮标注更能捕捉细粒度的程序语义。

Insight: 创新点在于构建了首个多轮健康视频检索基准MHVRC,并提出了结合双编码器粗检索与轻量级交叉编码器重排序的DATR框架,为交互式健康视频检索提供了可扩展的技术方案。

Abstract: The growing availability of health-related instructional videos creates new opportunities for clinical training, patient rehabilitation, and health education, yet existing retrieval systems remain largely single-turn: a user submits one query and receives one ranked list. This interaction is brittle in health scenarios, where information needs are often vague at first and become clinically meaningful only after follow-up constraints such as posture, hand placement, contraindications, equipment, or patient condition are specified. We introduce interactive multi-turn semantic retrieval for health videos and construct MHVRC, a Multi-Turn Health Video Retrieval Corpus, by combining video-grounded descriptions from VideoChat-Flash with query refinements generated by DeepSeek. We further propose DATR, a Dialogue-Aware Two-Stage Retrieval framework. DATR first performs efficient coarse retrieval with a CLIP-style dual encoder and sparse frame sampling, then re-ranks the top candidates through multi-turn query fusion and a lightweight cross-encoder scoring module. Experiments on MHVRC show consistent gains over strong text-video retrieval baselines, while user studies indicate that refined multi-turn queries better capture fine-grained procedural semantics than single-turn annotations. The work establishes a benchmark and a scalable technical recipe for interactive health video retrieval.


cs.SD [Back]

[168] MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio cs.SD | cs.AI | cs.CLPDF

Harshit Rajgarhia, Shuubham Ojha, Asif Shaik, Akhil Pothanapalli, Rachuri Lokesh

TL;DR: 本文介绍了MedMosaic,一个大规模、多样化的医学音频问答数据集,旨在真实临床约束下评估语言和音频推理模型。该数据集包含多种医学音频类型(如生理声音、合成语音、真实临床对话)和46,701个问答对,覆盖多项选择、多轮对话和开放式问题。对13个音频和多模态推理模型的基准测试显示,所有系统在医学推理上均面临挑战,性能差异显著,即使是Gemini-2.5-pro等最先进模型准确率也仅约68.1%。

Details

Motivation: 解决现有医学音频基准数据集因隐私法规和标注成本高而代表性不足的问题,提供更复杂、真实的医学音频场景以推动领域发展。

Result: 在MedMosaic基准上评估了13个模型,结果显示所有系统推理能力有限,性能因问题类型而异;最先进的Gemini-2.5-pro模型准确率约为68.1%,未达到高水平。

Insight: 创新点在于构建了大规模、多样化的医学音频数据集,涵盖多种音频类型和问答形式,能系统评估多跳推理和生成能力;客观分析表明,该数据集揭示了当前多模态模型在医学领域推理的普遍不足,强调了开发领域专用鲁棒模型的重要性。

Abstract: We present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address these challenges, MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models.


cs.RO [Back]

[169] Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation cs.RO | cs.CVPDF

Xitie Zhang, Aming Wu, Yahong Han

TL;DR: 本文提出Decompose and Recompose技能推理框架,通过将已见任务演示分解为可解释的原子技能-动作对齐,并利用视觉语义检索与规划代理构建动态演示库,实现无需参数更新的零样本跨任务机器人操作泛化。

Details

Motivation: 解决开放世界机器人操作中跨任务泛化的核心挑战,现有上下文学习方法仅提供低层连续动作序列,无法捕获可组合的技能知识,导致模型退化为浅层轨迹模仿。

Result: 在AGNOSTOS基准测试和真实世界环境中验证了方法的零样本跨任务泛化能力,表明其能有效推理新技能。

Insight: 创新点在于使用原子技能-动作对作为中间表示,结合动态与静态演示库显式激发组合推理,实现了从现有能力中分解和重组新技能的可解释性框架。

Abstract: Cross-task generalization is a core challenge in open-world robotic manipulation, and the key lies in extracting transferable manipulation knowledge from seen tasks. Recent in-context learning approaches leverage seen task demonstrations to generate actions for unseen tasks without parameter updates. However, existing methods provide only low-level continuous action sequences as context, failing to capture composable skill knowledge and causing models to degenerate into superficial trajectory imitation. We propose Decompose and Recompose, a skill reasoning framework using atomic skill-action pairs as intermediate representations. Our approach decomposes seen demonstrations into interpretable skill–action alignments, enabling the model to recompose these skills for unseen tasks through compositional reasoning. Specifically, we construct a task-adaptive dynamic demonstration library via visual-semantic retrieval combined with skill sequences from a planning agent, complemented by a coverage-aware static library to fill missing skill patterns. Together, these yield skill-comprehensive demonstrations that explicitly elicit compositional reasoning for skill composition and execution ordering. Experiments on the AGNOSTOS benchmark and real-world environments validate our method’s zero-shot cross-task generalization capability.


[170] Temporally Consistent Object 6D Pose Estimation for Robot Control cs.RO | cs.CVPDF

Kateryna Zorina, Vojtech Priban, Mederic Fourmy, Josef Sivic, Vladimir Petrik

TL;DR: 本文提出了一种基于因子图的方法,用于增强单视角RGB物体6D姿态估计的时间一致性,以提升其在机器人视觉控制中的稳定性和鲁棒性。该方法整合了物体运动模型、显式估计姿态测量不确定性,并通过在线优化实现。实验表明,该方法在标准姿态估计基准上显著提升了性能,并成功应用于基于反馈的机器人控制任务中。

Details

Motivation: 现有单视角RGB物体姿态估计方法缺乏时间一致性和鲁棒性,难以满足机器人稳定反馈控制的需求。

Result: 在标准姿态估计基准上显著改善了结果,并通过基于扭矩控制机械臂的相机跟踪任务验证了其在反馈控制中的稳定性。

Insight: 创新点在于将因子图框架引入姿态估计,结合运动模型和不确定性估计进行在线优化,增强了时间一致性,为机器人视觉控制提供了更可靠的解决方案。

Abstract: Single-view RGB object pose estimators have reached a level of precision and efficiency that makes them good candidates for vision-based robot control. However, off-the-shelf methods lack temporal consistency and robustness that are mandatory for a stable feedback control. In this work, we develop a factor graph approach to enforce temporal consistency of the object pose estimates. In particular, the proposed approach: (i) incorporates object motion models, (ii) explicitly estimates the object pose measurement uncertainty, and (iii) integrates the above two components in an online optimization-based estimator. We demonstrate that with appropriate outlier rejection and smoothing using the proposed factor graph approach, we can significantly improve the results on standardized pose estimation benchmarks. We experimentally validate the stability of the proposed approach for a feedback-based robot control task in which the object is tracked by the camera attached to a torque controlled manipulator.


[171] DynoSLAM: Dynamic SLAM with Generative Graph Neural Networks for Real-World Social Navigation cs.RO | cs.CVPDF

Danil Tokhchukov, Veronika Morozova, Gonzalo Ferrer

TL;DR: 本文提出了DynoSLAM,一种紧耦合的动态GraphSLAM架构,通过将社会感知的图神经网络直接集成到因子图优化中,解决了传统SLAM算法在动态环境中因依赖静态假设而受限的问题。该方法将行人运动预测建模为随机世界模型,利用GNN的蒙特卡洛推演捕捉人类交互的多模态认知不确定性,并通过动态马氏距离因子将其嵌入SLAM图。

Details

Motivation: 传统SLAM算法严重依赖静态环境假设,在存在移动实体(如行人)的真实世界场景中适用性受限,需要一种能处理动态环境并实现安全导航的SLAM方法。

Result: 通过大量模拟实验证明,该随机化方法不仅保持了高精度的回溯跟踪,还避免了由确定性’argmax问题’引起的优化失败,为下游局部规划器提供了数学上严谨的概率安全包络。

Insight: 创新点在于将社会感知的GNN以随机世界模型的形式紧密集成到SLAM因子图优化中,利用蒙特卡洛采样捕捉交互不确定性,并通过动态马氏距离因子进行嵌入,从而在拥挤环境中实现前瞻性、无碰撞的机器人导航。

Abstract: Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly-coupled Dynamic GraphSLAM architecture that integrates socially-aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant-velocity heuristics or deterministic single-agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model. By utilizing Monte Carlo rollouts from a trained GNN, we capture the multimodal epistemic uncertainty of human interactions and embed it into the SLAM graph via a dynamic Mahalanobis distance factor. We demonstrate through extensive simulated experiments that this stochastic formulation not only maintains highly accurate retrospective tracking but also prevents the optimization failures caused by the deterministic “argmax problem”. Ultimately, extracting the empirical mean and covariance matrices of future pedestrian states provides a mathematically rigorous, probabilistic safety envelope for downstream local planners, enabling anticipatory and collision-free robot navigation in densely crowded environments.


cs.SE [Back]

[172] Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture cs.SE | cs.CL | cs.LGPDF

Mehmet Iscan

TL;DR: 本文提出了一种名为RL Developer Memory的本地优先、基于模型上下文协议(MCP)的开发者记忆架构,专为强化学习(RL)编码智能体设计。该架构将记忆选择视为一个带日志的上下文决策过程,通过issue_match、issue_feedback和issue_record_resolution等组件处理候选记忆排名、反馈映射和解决方案链接,并采用确定性排序器与上下文老虎机残差策略的混合部署模式,通过保守的离策略评估(OPE)门控确保安全。

Details

Motivation: 针对当前LLM编码智能体在长期软件工程任务中,静态向量存储或通用检索增强生成(RAG)无法有效处理强化学习代码开发中细节敏感(如影响贝尔曼目标、终端掩码、梯度流或验证声明)的问题,需要一种更安全、可审计的记忆控制架构。

Result: 在一个包含RL算法错误、硬负例、审查门控RL/控制案例及低风险失败的确定性200案例基准测试中,确定性控制和完整影子/OPE配置均达到80.0%的预期决策准确率和100.0%的硬负例抑制率;完整配置主要增加了学习遥测而非提升准确率。静态验证通过11/11项检查,动态集成通过10/10个案例。

Insight: 创新点在于提出了一种具有明确声明边界、可审计的记忆控制架构,将记忆选择建模为带日志的上下文决策过程,并引入安全门控机制(如保守OPE门控)来管理学习策略的影响,确保RL/控制记忆所需的从理论到代码的元数据及审查门控治理,而非声称通用的编码智能体改进。

Abstract: Large language model (LLM) coding agents increasingly operate over repositories, terminals, tests, and execution traces across long software-engineering episodes. Persistent memory is useful, but static vector stores or generic retrieval-augmented generation (RAG) are insufficient for reinforcement-learning (RL) code development, where small details can alter Bellman targets, terminal masks, gradient flow, or validation claims. This paper presents RL Developer Memory, a local-first, Model Context Protocol (MCP)-native developer-memory architecture for RL coding agents. It treats memory selection as a logged contextual decision process: issue_match ranks candidates and records telemetry, issue_feedback maps raw labels to bounded rewards, and issue_record_resolution links verified resolutions to earlier retrieval events. A deterministic ranker remains deployed, while a contextual-bandit residual policy runs in shadow mode and can affect canary behavior only through conservative off-policy-evaluation (OPE) gates. RL/control memories require theory-to-code metadata and review-gated governance. The system is evaluated on a deterministic 200-case benchmark with RL algorithm bugs, hard negatives, review-gated RL/control cases, and low-risk failures. In the same-commit comparison, deterministic control and full shadow/OPE both achieve 80.0% expected-decision accuracy and 100.0% hard-negative suppression; the full configuration adds learning telemetry rather than accuracy gain. Static validation passed 11/11 checks; dynamic integration passed 10/10 cases. The evidence reports limits: active learned-policy deployment and official-client MCP interoperability are unsupported, live full-configuration latency regresses, and 40 residual non-RL failures remain. The contribution is an auditable memory-control architecture with explicit claim boundaries, not a universal coding-agent improvement claim.


cs.LG [Back]

[173] When Less is Enough: Efficient Inference via Collaborative Reasoning cs.LG | cs.AI | cs.CLPDF

Yilei Chen, Sharut Gupta, Yannis Paschalidis, Ayush Sekhari, Aldo Pacchiano

TL;DR: 本文提出了DUET(双模型高效两阶段推理)框架,通过让一个能力强的大模型和一个轻量级小模型协作完成任务,将推理过程分解为两个阶段:大模型生成推理信号,小模型解读该信号并生成最终答案,从而在保持任务性能的同时显著降低推理成本。

Details

Motivation: 解决单一大型模型进行端到端推理时计算成本过高的问题,通过分解推理过程,将推理密集型计算分配给大模型,非推理密集型部分委托给小模型,以实现高效推理。

Result: 在AIME和GPQA等具有挑战性的推理基准测试中,DUET相比单独使用大模型的端到端推理,节省了高达60%的大模型输出token,同时保持了强大的推理性能。

Insight: 创新点在于两阶段协作推理框架和长度惩罚联合训练目标,鼓励大模型仅传输足够小模型完成任务的信息,从而在性能和效率之间取得平衡;从客观角度看,这种分解策略为模型部署中的资源优化提供了新思路。

Abstract: In this work, we introduce DUET (Dual-model Efficient Two-stage inference), a collaborative inference framework in which a capable model and a lightweight model work together to solve a task. Relying on a single large model to perform end-to-end reasoning and prediction often incurs substantial inference cost. In contrast, DUET decomposes inference into two stages: the capable model produces a reasoning signal, and the lightweight model interprets this signal to generate the final answer, allowing reasoning-intensive computation to be handled by the capable model while non-reasoning-intensive components are delegated to the lightweight model without sacrificing task performance. To achieve this objective, we propose a length-penalized joint training objective that encourages the capable model to transmit only the information that is sufficient for the lightweight model to solve the task. As a result, DUET maintains strong reasoning performance with substantially lower inference cost than end-to-end inference using a large model alone, saving up to 60% of the large model’s output tokens on challenging reasoning benchmarks, including AIME and GPQA.


[174] Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning for Speech and Reasoning Tasks cs.LG | cs.CLPDF

Zongqian Li, Yixuan Su, Han Zhou, Zihao Fu, Nigel Collier

TL;DR: 本文提出了Flexi-LoRA,一种新颖的高效微调框架,它能够根据输入复杂度在训练和推理过程中动态调整LoRA的秩。该方法在问答、数学推理和语音任务上进行了实证分析,结果表明,通过输入依赖的参数分配,能以更少的参数实现更高的性能,特别是在需要严格推理链的任务上表现更优。

Details

Motivation: 现有参数高效微调方法(如LoRA)采用静态参数分配,对于复杂度不同的输入并非最优。本文旨在解决这一问题,通过动态调整LoRA秩来更高效地匹配输入复杂度。

Result: Flexi-LoRA在多个任务上一致优于静态LoRA,且使用参数更少。在需要严格推理链的任务上,性能提升更为显著。该方法实现了与专家混合框架类似的关键优势,但实现更精简。

Insight: 核心创新点是提出了输入自适应的动态LoRA秩调整机制,确保了训练与推理动态的一致性。研究发现,任务对秩动态的依赖程度不同(如数学推理任务高于问答任务),且成功的适应不仅体现在答案正确性上,还体现在推理质量和指令遵循上。这为未来输入自适应的高效微调方法提供了基础。

Abstract: Parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) have become essential for deploying large language models, yet their static parameter allocation remains suboptimal for inputs of varying complexity. We present Flexi-LoRA, a novel framework that dynamically adjusts LoRA ranks based on input complexity during both training and inference. Through empirical analysis across question answering, mathematical reasoning, and speech tasks, we demonstrate that maintaining consistency between training and inference dynamics is important for effective adaptation, particularly for sequential reasoning tasks. Our findings reveal that input-dependent parameter allocation achieves higher performance with fewer parameters by optimally matching rank configurations to question complexity. Furthermore, task-specific dependency on rank dynamics varies, with mathematical reasoning tasks exhibiting higher dependency than QA tasks. Successful adaptation manifests not only in correctness but also in reasoning quality and instruction adherence. Flexi-LoRA consistently outperforms static LoRA while using fewer parameters, with performance gains more pronounced on tasks requiring strict reasoning chains. Our approach realizes key benefits of mixture-of-experts frameworks through a more streamlined implementation, reducing parameter redundancy while improving model capabilities. We provide comprehensive empirical studies across diverse tasks, establishing a basis for future work in input-adaptive and efficient fine-tuning approaches.


[175] GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI cs.LG | cs.CVPDF

Duaa Alim, Mogtaba Alim, Liam Chalcroft

TL;DR: 本文提出了GAZE框架,这是一个用于医学视觉语言模型的迭代式、基于工具的评估框架。它允许模型在生成报告前调用图像查看工具和医学文献检索工具,以模拟放射科医生的诊断流程。该框架在包含906个罕见脑部MRI病例的NOVA基准测试中,实现了病灶定位和诊断准确率的显著提升。

Details

Motivation: 当前视觉语言模型一次性处理图像并生成文本,这与放射科医生反复查看图像并查阅文献的迭代诊断流程不符。本文旨在通过工具调用和检索机制,让医学视觉语言模型能够以更接近人类专家的方式工作。

Result: 在NOVA基准测试(涵盖281种罕见神经系统疾病的906个脑部MRI病例)上,GAZE框架在无需任务特定微调的情况下,实现了病灶定位mAP@0.3为58.2,以及联合评估(图像描述、诊断、定位)的Top-1诊断准确率为34.9%。相比基线模型(Gemini 2.0 Flash)有显著提升。工具使用对罕见病例的提升尤为明显。

Insight: 主要创新点在于将代理式、迭代式的工具调用(图像查看工具和医学文献/图像检索)引入医学视觉语言模型的评估框架,并强调结构化输出和可审计性。客观来看,其核心贡献是提出了一个模拟真实临床工作流的评估范式,并通过实验揭示了工具使用对罕见病例诊断的差异化增益,以及诊断与定位性能之间可能存在的权衡,这强调了在医学VLM中进行联合评估的重要性。

Abstract: Vision-language models (VLMs) read an image and produce text in a single forward pass, whereas radiologists typically inspect an image several times and consult the literature before writing a report. We introduce GAZE (Grounded Agentic Zero-shot Evaluation), a framework that lets a medical VLM work in this iterative way by calling viewer-level tools (zoom, windowing, contrast, edge detection) and two retrieval tools backed by the U.S. National Library of Medicine (PubMed for medical literature, Open-i for radiological images), with structured outputs validated against a schema and full tool-call traces recorded for auditability. On NOVA, a benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE reaches 58.2 mean average precision (mAP) at intersection-over-union (IoU) 0.3 for lesion localisation and 34.9% Top-1 diagnostic accuracy under a joint protocol that scores captioning, diagnosis, and localisation from the image alone, without task-specific fine-tuning. Before any tool is used, structured prompting and schema-validated outputs already improve over the published Gemini 2.0 Flash baseline (20.2 to 29.4 mAP@0.3), so framework design is itself an experimental variable. Tool use helps rare pathologies disproportionately: the fraction of cases with IoU > 0.3 rises from 17% to 58% for diagnoses with three or fewer examples versus 25% to 68% for common conditions ($\geq$10 cases), with gains tracking engagement (Gemini 3 Flash: Cohen’s d = 0.79, 11.8 tool calls per case; Gemini 2.0 Flash: tools used in 8.2% of cases, no significant benefit). Retrieval ablations additionally reveal a model-dependent trade-off in which gains in diagnosis can coincide with losses in localisation, reinforcing the case for joint evaluation of diagnosis, localisation, and captioning in medical VLMs.


[176] Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time cs.LG | cs.CV | eess.ASPDF

Itai Allouche, Joseph Keshet

TL;DR: 本文提出了一种名为LIME的训练无关框架,旨在缓解多模态大语言模型在推理时产生的幻觉问题。该方法利用层间相关性传播技术量化输入贡献,并通过更新模型的键值表示来增强对感知输入的依赖,从而减少模型对文本先验的过度依赖。

Details

Motivation: 多模态大语言模型在推理时存在模态利用不平衡问题,文本标记的主导地位削弱了感知输入的作用,导致模型倾向于依赖文本先验而非实际证据,从而产生幻觉输出。

Result: 在多个视觉和音频领域的多模态基准测试中,LIME一致地减少了幻觉并增强了模型的多模态基础性,同时保持了生成质量。

Insight: 创新点在于提出了一种无需训练、不修改模型参数的推理时干预方法,通过相关性传播和键值表示更新来显式提升感知模态的贡献,这为缓解多模态幻觉提供了一种轻量级且可解释的解决方案。

Abstract: Multimodal large language models (MLLMs) have revolutionized the landscape of AI, demonstrating impressive capabilities in tackling complex vision and audio-language tasks. However, a critical challenge remains: these models often suffer from hallucinations, generating outputs that diverge from the provided perceptual inputs. This tendency stems from an inherent imbalance in modality utilization during inference, where the dominance of textual tokens undermines the potential of perceptual inputs. As a result, the model frequently resorts to textual language priors at the expense of grounded evidence. To tackle this issue, we propose Learning Inference-time Modality Enhancement (LIME), a training-free framework designed to bolster multimodal grounding by explicitly enhancing modality usage during decoding. LIME leverages Layer-wise Relevance Propagation (LRP) to quantify token-level contributions and defines a relevance-based objective that promotes increased reliance on perceptual inputs. This objective is enforced through inference-time updates to the model’s key-value representations, without modifying model parameters or requiring additional training data. We evaluate LIME across multiple multimodal benchmarks in both vision and audio domains, demonstrating consistent reductions in hallucinations and enhanced grounding while preserving generation quality. Further analysis shows that LIME increases modality contribution and produces more localized and semantically aligned relevance patterns.


[177] MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization cs.LG | cs.CVPDF

Yavuz Yarici, Ghassan AlRegib

TL;DR: 本文提出了一种名为MER-DG(模态-熵正则化)的方法,用于解决多模态领域泛化(MMDG)问题。该方法通过最大化每个模态编码器特征分布的熵来保持特征多样性,从而防止模型在训练中过度依赖源领域特有的跨模态共现统计关系(即融合过拟合)。MER-DG是一种与架构无关的附加损失项,可集成到现有多模态框架中。

Details

Motivation: 标准多模态模型采用独立编码器和融合模块进行端到端训练,但这种联合优化会导致编码器利用源领域特有的跨模态共现统计关系,而非学习领域不变特征,作者将这种失败模式称为“融合过拟合”。本文旨在解决这一问题,提升模型在新环境(即不同记录条件)下的泛化能力。

Result: 在EPIC-Kitchens和HAC基准测试上的大量实验表明,MER-DG相比标准融合方法平均提升约5%,相比最先进(SOTA)方法平均提升约2%。

Insight: 论文的核心创新点是识别并形式化了多模态领域泛化中的“融合过拟合”问题,并提出了一种简单有效的架构无关正则化方法(MER-DG),通过最大化单模态特征熵来鼓励特征多样性,抑制对虚假跨模态关联的依赖,从而学习更鲁棒的领域不变表示。这种方法易于集成,为多模态泛化提供了新思路。

Abstract: Deploying multimodal models in real-world scenarios requires generalization to new environments where recording conditions differ from training, a challenge known as multimodal domain generalization (MMDG). Standard architectures employ separate encoders for each modality and a fusion module, training the system end-to-end by optimizing on the fused features. In this paper, we identify that such joint optimization causes encoders to exploit cross-modal co-occurrences, statistical relationships between modalities that arise from source-specific recording conditions, rather than learning domain-invariant features. We term this failure mode Fusion Overfitting. To address this, we propose Modality-Entropy Regularization for Domain Generalization (MER-DG), which maximizes the entropy of each encoder’s feature distribution to preserve feature diversity. MER-DG is architecture-agnostic and integrates into existing multimodal frameworks as an additive loss term. Extensive experiments on EPIC-Kitchens and HAC benchmarks demonstrate average improvements of approximately 5% over standard fusion and approximately 2% over state-of-the-art methods.


cs.CR [Back]

[178] LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning cs.CR | cs.AI | cs.CL | cs.LGPDF

Joseph Spracklen, Pedram Aghazadeh, Farinaz Koushanfar, Murtuza Jadliwala

TL;DR: 本文提出了一种名为自适应遗忘(Adaptive Unlearning, AU)的后部署框架,旨在通过外科手术式的方法抑制大型语言模型(LLM)在代码生成中产生的幻觉(如推荐不存在的软件包),同时保持模型的通用性能。该方法结合了混合令牌级目标和自适应发现循环,无需人工监督即可持续发现并抑制新的幻觉诱导上下文,从而有效减少包幻觉率并降低供应链攻击风险。

Details

Motivation: 解决已部署LLM在代码生成中频繁产生包幻觉(推荐虚构库)的问题,这类幻觉会引发名为’slopsquatting’的供应链攻击(攻击者抢先注册恶意包),而现有缓解方法(如完全重训练或依赖预设遗忘集)成本高昂、会严重损害模型效用或不适用于无界幻觉空间。

Result: 在标准编码基准测试中,AU将包幻觉率降低了81%,显著减少了slopsquatting攻击面,同时保持了模型性能;分析表明分布变化主要集中在包相关生成上,通用编码行为基本不受影响。

Insight: 创新点在于提出了一种无需人工标注、完全基于模型生成数据的自适应遗忘框架,通过混合令牌级目标(同时强化有效输出和抑制幻觉)与自适应发现循环的结合,实现了对未见提示和幻觉的泛化抑制,且效果具有针对性,不影响模型整体效用。

Abstract: Hallucinations, outputs that sound plausible but are factually incorrect, remain an open challenge for deployed LLMs. In code generation, models frequently hallucinate non-existent software packages, recommending imports and installation commands for fictional libraries. This creates a critical supply-chain vulnerability: an attacker can proactively register such packages on public registries with malicious payloads that are subsequently installed and executed by developers or autonomous agents, a class of package confusion attack known as slopsquatting. Once a model is deployed, mitigating this failure mode is difficult: full retraining is costly, and existing approaches either cause severe degradation of model utility or rely on a pre-specified forget-set, an assumption that does not apply to the unbounded space of hallucinations. To address this problem, we present Adaptive Unlearning (AU), a post-deployment framework that surgically suppresses hallucinations while preserving general model utility. AU introduces a hybrid token-level objective that simultaneously reinforces valid outputs and suppresses hallucinated ones. Combined with an adaptive discovery loop that continuously surfaces new hallucination-inducing contexts without human supervision, AU enables generalization to unseen prompts and hallucinations. We demonstrate that AU reduces package hallucination rates by 81%, corresponding to a substantial reduction in slopsquatting attack surface, while maintaining performance on standard coding benchmarks. Our analysis shows that distributional changes are concentrated on package-related generations, leaving general coding behavior largely unaffected and confirming that AU’s effect is isolated to the targeted distribution. AU operates entirely on model-generated data, requires no human annotation, and generalizes across domains.


[179] Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training cs.CR | cs.CLPDF

Wenjing Duan, Qi Zhou, Yuanfan Li

TL;DR: 本文提出了一种名为REACT的对抗训练框架,旨在提升少样本条件下机器生成文本检测的准确性和鲁棒性。该框架通过结合检索增强生成(RAG)的攻击器与目标检测器,使两者在对抗中协同进化,从而增强检测器对高度类人化对抗样本的防御能力。

Details

Motivation: 现有机器生成文本检测器在少样本场景下性能不佳,且容易受到旨在使文本更类人化的对抗攻击。本文从攻击者视角进行威胁建模,旨在构建在有限监督下既准确又鲁棒的检测器。

Result: 在4个数据集、4种样本量(shot size)和3个随机种子下的实验表明,REACT相比8个SOTA检测器,平均检测F1分数提升了4.95个百分点;在4种强攻击下,平均攻击成功率(ASR)降低了3.66个百分点。

Insight: 核心创新点在于采用“以毒攻毒”的对抗训练思想,将RAG驱动的、专门生成高度类人化对抗样本的攻击器与基于对比学习的少样本检测器耦合,通过交替更新实现两者的协同进化,从而稳定少样本表示学习并提升鲁棒性。这为构建鲁棒的少样本检测器提供了一种有效的对抗训练范式。

Abstract: Machine-generated text (MGT) detection is critical for regulating online information ecosystems, yet existing detectors often underperform in few-shot settings and remain vulnerable to adversarial, humanizing attacks. To build accurate and robust detectors under limited supervision, we adopt a threat-modeling perspective and study detector vulnerabilities from an attacker’s viewpoint under an output-only black-box setting. Motivated by this perspective, we propose RAG-GuidEd Attacker Strengthens ConTrastive Few-shot Detector (REACT), an adversarial training framework that improves both few-shot detection performance and robustness against attacks. REACT couples a humanization-oriented attacker with a target detector: the attacker leverages retrieval-augmented generation (RAG) to craft highly human-like adversarial examples to evade detection, while the detector learns from these adversaries with a contrastive objective to stabilize few-shot representation learning and enhance robustness. We alternately update the attacker and the detector to enable their co-evolution. Experiments on 4 datasets with 4 shot sizes and 3 random seeds show that REACT improves average detection F1 by 4.95 points over 8 state-of-the-art (SOTA) detectors and reduces the average attack success rate (ASR) under 4 strong attacks by 3.66 percentage points.


cs.MM [Back]

[180] OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models cs.MM | cs.AI | cs.CL | cs.CV | cs.LGPDF

Yida Xue, Ningyu Zhang, Tingwei Wu, Zhe Ma, Daxiong Ji

TL;DR: 本文介绍了OceanPile,一个为海洋基础模型设计的大规模多模态语料库,旨在解决海洋人工智能领域的数据瓶颈问题。它包含三个核心部分:整合声纳、水下图像、海洋科学视觉和文本的OceanCorpus;通过分层海洋概念知识图谱指导生成的高质量指令数据集OceanInstruction;以及用于严格评估的手动策划基准OceanBenchmark。

Details

Motivation: 海洋数据高度分散、多模态、高噪声且标签弱,缺乏统一模式和语义对齐,这限制了多模态大语言模型在海洋科学中的应用。论文旨在通过构建大规模、对齐良好的多模态数据集来弥合这一差距。

Result: 实验验证表明,使用OceanPile数据训练的模型性能得到显著提升。

Insight: 创新点在于构建了一个专门针对海洋领域、经过严格质量控制的多模态语料库,并引入了基于分层海洋概念知识图谱的指令合成流程,以及一个手动策划的评估基准,以推动领域特定MLLM的发展。

Abstract: The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.


[181] BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios cs.MM | cs.AI | cs.CVPDF

Advait Tilak, Jiwon Choi, Nazifa Mouli, Wei Le

TL;DR: 本文提出了BRITE,一个用于评估文本到视频(T2V)生成模型在不可信场景下性能的综合性基准测试框架。该框架统一了不可信提示、细粒度视听一致性评估和基于问答的可解释性评估,并通过严格的人工参与协议确保可靠性。通过对Sora 2等五个先进模型的评估,揭示了模型在静态物体组合上表现优异,但在物体-动作绑定和视听同步方面存在显著缺陷。

Details

Motivation: 现有T2V评估基准大多忽视了不可信场景,且未能衡量视听对齐,无法满足当前生成模型快速发展的评估需求。

Result: 在评估Sora 2、Veo 3.1、Runway Gen4.5、Pixverse V5.5和Qwen3Max五个最先进模型时,发现模型在静态物体组合上表现出色,但在物体-动作绑定和视听同步方面性能显著下降。

Insight: 创新点在于首次将不可信提示、细粒度视听一致性评估和基于QA的可解释性评估统一到一个基准中,并通过人工参与协议保证评估可靠性,为社区提供了一个能检测和定位下一代T2V模型(尤其是处理离分布提示时)局限性的可靠、可解释的评估框架。

Abstract: The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts


[182] Multimodal Confidence Modeling in Audio-Visual Quality Assessment cs.MM | cs.CV | cs.SD | eess.IVPDF

Mayesha Maliha R. Mithila, Mylene C. Q. Farias

TL;DR: 本文提出了MCM-AVQA,一种多模态置信度感知的音视频质量评估框架。该框架通过显式估计音频和视频各自的置信度,并将其注入一个专用的音视频混合器中,利用置信度引导的跨模态注意力机制来调制特征融合,使得高置信度的模态主导融合过程,同时抑制不可靠的输入。

Details

Motivation: 在现实流媒体场景中,音视频失真常常是不对称的(一个模态严重退化而另一个保持清晰),但现有大多数AVQA指标将音频和视频视为同等可靠,导致融合时强调了不可靠的信号。

Result: 在多个AVQA基准测试上的实验表明,MCM-AVQA,特别是其置信度引导的音视频混合器,提高了与人类平均意见分数的相关性,并在现实世界不对称音视频失真下产生了更可解释的行为。

Insight: 核心创新在于显式建模模态特定置信度,并将其作为门控机制融入跨模态注意力融合过程;具体地,视觉置信度估计器将帧级伪影概率转化为时序平滑的片段级分数,音频置信度模块则从语音质量线索中推导置信度,无需干净参考信号。

Abstract: Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA metrics treat audio and video as equally reliable, causing confidence-unaware fusion to emphasize unreliable signals. This paper proposes MCM-AVQA, a multimodal confidence-aware AVQA framework that explicitly estimates modality-specific confidence and injects it into a dedicated audio-visual mixer for cross-modal attention. The Audio-Visual Mixer utilizes frame-level, confidence-guided channel attention to gate fusion, modulating feature interaction between modalities so that high-confidence streams dominate while unreliable inputs are suppressed, preserving temporal degradation patterns. A multi-head visual confidence estimator turns frame-level artifact probabilities into temporally smoothed, clip-level visual confidence scores, while an audio confidence module derives confidence from speech-quality cues without requiring a clean reference. Experiments on multiple AVQA benchmarks show that MCM-AVQA, and specifically its confidence-guided Audio-Visual Mixer, improve correlation with human mean opinion scores and yield more interpretable behavior under real-world asymmetric audio-visual distortions.


cs.AI [Back]

[183] Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy cs.AI | cs.CL | cs.SD | eess.ASPDF

Shakeel Sheikh, Patrick Marmaroli, MD Sahidullah, Slim Ouni, Fabrice Hirsch

TL;DR: 本文开发了虚拟言语治疗师(VST),这是一个基于智能代理的平台,通过自动化、自适应的人工智能驱动工作流程,简化口吃评估并提供定制化治疗计划。VST集成了基于深度学习的最先进口吃分类技术和多代理大语言模型推理,以支持基于证据的临床决策。系统从患者语音样本采集和特征提取开始,随后进行口吃类型的鲁棒分类。基于这些输出,VST启动一个代理推理过程,其中专门的LLM代理自主生成、批判并迭代优化个性化治疗计划。一个专门的批判代理评估所有生成的治疗计划,以确保临床安全性、方法合理性,并与同行评审证据及既定专业指南保持一致。最终输出一份全面的、针对患者的治疗草案供临床医生审查。结合临床医生的反馈,系统随后生成适合交付给患者的最终治疗计划,从而维持临床医生在环的范式。专家言语治疗师的实验评估证实,VST能够持续生成高质量、基于证据的治疗建议。这些发现展示了该系统在增强临床工作流程、减轻临床医生负担以及改善言语障碍患者治疗效果方面的潜力。

Details

Motivation: 开发一个自动化、个性化的AI驱动平台,以简化口吃评估和治疗计划制定,减轻临床医生负担,并改善言语障碍患者的治疗效果。

Result: 专家言语治疗师的实验评估证实,VST能够持续生成高质量、基于证据的治疗建议,展示了其在增强临床工作流程和改善治疗效果方面的潜力。

Insight: 创新点在于将最先进的深度学习口吃分类与多代理LLM推理相结合,通过专门的批判代理确保治疗计划的临床安全性和方法合理性,并采用临床医生在环的范式,实现个性化、监督式的治疗计划生成。从客观角度看,该系统整合了自动化评估与AI辅助决策,并强调人机协作,为AI在医疗辅助领域的应用提供了一个可借鉴的框架。

Abstract: This paper develops Virtual Speech Therapist (VST), an intelligent agent-based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI-driven workflows. VST integrates state-of-the-art deep learning-based stuttering classification, and multi-agent large language model (LLM) reasoning to support evidence-based clinical decision-making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer-reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient-specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician-in-the-loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high-quality, evidence-based therapy recommendations. These findings demonstrate the system’s potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: https://vocametrix.com/ai/stuttering-therapy-planning-agent , facilitating real-time stuttering assessment and personalized therapy planning.


[184] GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models cs.AI | cs.CLPDF

Zhouhao Sun, Xuan Zhang, Xiao Ding, Bibo Cai, Li Du

TL;DR: 本文提出了GR-Ben基准,用于全面评估过程奖励模型在科学和逻辑等一般推理领域的错误检测能力,弥补现有基准主要关注数学推理的不足。通过对22个模型进行广泛实验,发现现有PRMs和LLMs在非数学领域的错误检测能力较弱,且PRMs不擅长识别知识性错误,而LLMs在检测计算错误方面表现较差。

Details

Motivation: 现有过程奖励模型基准主要集中于数学推理,无法全面评估PRMs在多样化现实推理场景中的过程级错误检测能力,因此需要构建一个覆盖更广推理领域的基准。

Result: 在涵盖科学和逻辑两大领域、九个子领域的GR-Ben基准上测试了22个模型(包括PRMs和LLMs),发现它们在非数学领域的错误检测能力显著较弱;PRMs在识别知识性错误上表现不佳,而LLMs在检测计算错误上较差。

Insight: 创新点在于构建了首个专注于评估PRMs在一般推理领域(科学和逻辑)过程级错误检测能力的基准GR-Ben,揭示了现有模型在跨领域错误检测上的局限性,为未来PRMs研究提供了方向。

Abstract: Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM’s performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational errors.We hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.


[185] SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning cs.AI | cs.CLPDF

Tianshi Zheng, Rui Wang, Xiyun Li, Yangqiu Song, Tianqing Fang

TL;DR: SciResearcher是一个用于前沿科学推理的全自动智能体框架,通过合成基于学术证据的多样化概念与计算任务,构建高质量训练数据,并基于此训练出SciResearcher-8B智能体基础模型,在多个科学推理基准上取得了显著性能提升。

Details

Motivation: 解决前沿科学领域中,现有深度研究智能体因领域知识分散在稀疏异构的学术资源中、且问题解决需要超越事实回忆的复杂计算与推理,而面临数据构建策略(如知识图谱构建或迭代网络浏览)固有局限的问题。

Result: 在HLE-Bio/Chem-Gold基准上达到19.46%,在其参数量规模上建立了新的SOTA,超越了多个更大的专有智能体;在SuperGPQA-Hard-Biology和TRQA-Literature基准上分别实现了13-15%的绝对性能提升。

Insight: 提出了一种面向前沿科学推理的全自动数据构建新范式,通过合成任务来激发信息获取、工具集成推理和长程规划能力,为未来科学智能体提供了可扩展的路径。

Abstract: Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.


[186] Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading cs.AI | cs.CL | cs.MAPDF

Polydoros Giannouris, Yuechen Jiang, Lingfei Qian, Yuyan Wang, Xueqing Peng

TL;DR: 本文提出Moira框架,将配对交易建模为分层强化学习问题,其中高层策略负责资产对选择,低层策略负责部分可观测下的短期执行。该框架利用大型语言模型参数化分层策略,仅通过提示更新进行优化,无需梯度微调,通过轨迹和回合级文本反馈来调整抽象和执行。

Details

Motivation: 解决具有层次结构的序列决策问题中的信用分配挑战,特别是在配对交易领域,其中长期语义推理与短期执行相结合,反馈延迟且模糊,导致性能下降可能源于有缺陷的抽象、次优执行或它们的交互。

Result: 在真实市场数据上的实验表明,该方法相比传统和基于LLM的基线模型取得了持续改进,证明了语言驱动分层强化学习的有效性。

Insight: 创新点在于将LLM作为分层策略的参数化方式,并通过纯提示更新进行优化,从而明确分离抽象选择与执行,减少层次间的非平稳性,并在延迟反馈下实现针对性适应;从客观角度看,该方法避免了梯度微调,利用预训练LLM的语义能力处理层次决策,为复杂决策问题提供了新思路。

Abstract: Many sequential decision-making problems exhibit hierarchical structure, where high-level semantic choices constrain downstream actions and feedback is delayed and ambiguous. Learning in such settings is challenging due to credit assignment: performance degradation may arise from flawed abstractions, suboptimal execution, or their interaction. We study this challenge through pair trading, a domain that naturally combines long-horizon semantic reasoning for asset pair selection with short-horizon execution under partial observability. We formulate pair trading as a hierarchical reinforcement learning problem and propose a language-driven optimization framework in which both high-level and low-level policies are parameterized by large language models (LLMs) and optimized exclusively through prompt updates. Our approach leverages pretrained LLMs as hierarchical policies and uses trajectory- and episode-level textual feedback to adapt abstractions and execution without gradient-based fine-tuning. By explicitly separating abstraction selection from execution, the framework reduces non-stationarity across hierarchical levels and enables targeted adaptation under delayed feedback. Experiments on real-world market data show consistent improvements over traditional and LLM-based baselines, demonstrating the effectiveness of language-driven hierarchical reinforcement learning.


[187] The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure cs.AI | cs.CL | cs.LGPDF

Rahul Kumar

TL;DR: 该论文研究了前沿AI模型在对抗性压力下的元认知稳定性,发现大多数模型在面临合规性强制指令时会经历灾难性的元认知退化,即所谓的’合规陷阱’。通过SCHEMA评估框架对11个前沿模型进行大规模测试,结果显示8个模型在对抗压力下准确率下降高达30.2个百分点,而Anthropic的Constitutional AI因对齐训练表现出近乎完美的免疫力。

Details

Motivation: 当前AI安全评估主要关注战略性欺骗(scheming),但论文旨在研究更根本的失败模式——认知崩溃,以评估前沿AI模型在高风险决策中保持元认知稳定性的能力。

Result: 在67,221条评分记录的6条件因子设计中,11个模型中的8个在对抗性压力下出现灾难性元认知退化,准确率下降高达30.2个百分点(所有p值<2×10^-8,经Bonferroni校正后仍显著)。Anthropic的Constitutional AI表现出近乎完美的免疫力,而谷歌Gemini仅匹配其基线准确率。

Insight: 论文创新性地揭示了’合规陷阱’现象,即元认知崩溃主要由合规性强制指令驱动,而非威胁内容本身;通过因子隔离和良性干扰控制证明了这一点。此外,研究发现高级推理能力模型退化最严重,而特定对齐训练(如Constitutional AI)能有效提升免疫力,这为AI安全设计提供了关键见解。

Abstract: As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability – knowing what they do not know, detecting errors, seeking clarification – under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 10^{-8}$, surviving Bonferroni correction). Crucially, we identify a “Compliance Trap”: through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic’s Constitutional AI demonstrates near-perfect immunity – not from superior capability (Google’s Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.


[188] Measuring AI Reasoning: A Guide for Researchers cs.AI | cs.CLPDF

Munachiso Samuel Nwadike, Zangir Iklassov, Kareem Ali, Rifo Genadi, Kentaro Inui

TL;DR: 本文为研究人员提供了评估语言模型推理能力的指南,主张推理评估应基于自适应、多步骤搜索的证据,而非仅依赖最终答案的准确性。

Details

Motivation: 解决当前仅凭最终答案准确性评估语言模型推理能力的不足,强调需要诊断和调试前沿模型生成解决方案的底层过程。

Result: 未提及具体定量结果或基准测试,但提出了过程评估的理论框架,强调中间推理轨迹的忠实性和有效性作为评估目标。

Insight: 创新点在于将推理形式化为类似搜索的过程,主张采用中间解码和外部化推理轨迹作为评估接口,推动从结果评估向过程评估的转变。

Abstract: In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, which we formalize as a search-like procedure. We show that single forward passes in scalable architectures are structurally limited in their ability to realize such variable-depth computation, motivating intermediate decoding and externalized reasoning traces as appropriate evaluation interfaces. Central to our argument is that final-answer accuracy alone is an insufficient measure of reasoning, because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. We therefore argue for a shift toward process-based evaluation, in which reasoning is assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.


[189] Shadow-Loom: Causal Reasoning over Graphical World Model of Narratives cs.AI | cs.CLPDF

David Wilmot

TL;DR: Shadow-Loom是一个实验性开源框架,它将叙事文本转化为版本化的图世界模型,并利用两个引擎进行处理:一个基于Pearl因果阶梯和反事实演算的因果物理引擎,以及一个基于Sternberg好奇心/悬念/惊喜三元组理论、评估叙事结构(如神秘、戏剧性反讽、悬念和惊喜)的叙事物理引擎。该系统主要使用大型语言模型进行边界任务(如提取和渲染),而核心的识别、干预和反事实推理则在类型化代码中基于图结构执行。

Details

Motivation: 解决如何对叙事进行因果推理和结构分析的问题,通过构建图世界模型来形式化理解故事中的因果关系和读者情感状态(如悬念和惊喜)。

Result: 论文未提供具体的基准测试结果或性能比较,而是将系统定位为研究工具而非基准NLP模型,并开源了代码、数据和流程。

Insight: 创新点在于结合因果推理理论(Pearl阶梯)与叙事结构分析(Sternberg三元组),在图形化世界模型中实现可解释的因果和情感计算,同时限制LLM的使用范围以增强可控性和透明度。

Abstract: Stories hold a reader’s attention because they have causes, secrets, and consequences. Shadow-Loom is an experimental open-source framework that turns a narrative into a versioned graphical world model and lets two engines act on it: a causal physics grounded in Pearl’s ladder of causation and a recently proposed counterfactual calculus over Ancestral Multi-World Networks; and a narrative physics that scores the same graph against four structural reader-states – mystery, dramatic irony, suspense, and surprise – in the tradition of Sternberg’s curiosity/suspense/surprise triad, with suspense formalised in the structural-affect line of work on story comprehension and computational suspense. Large language models are used only at the boundary: extraction, rendering, and audit; identification, intervention, and counterfactual reasoning are carried out in typed code over the graph. The system is offered as a research artefact rather than as a benchmarked NLP model; code, fixtures, and pipeline are released open source.


[190] The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge cs.AI | cs.CL | cs.HCPDF

Panagiotis Tzirakis, Alice Baird, Jeffrey Brooks, Emilia Parada-Cabaleiro, Lukas Stappen

TL;DR: 2026年ACII双人对话研讨会暨挑战赛提出了一个用于建模双人对话中人际情感和社会动态的基准。该挑战基于Hume-DaiKon数据集,包含945个自然条件下收集的双语对话,并设置了三个协调的子任务:方向性人际影响预测、话轮转换预测和全程互动中的融洽度轨迹预测。

Details

Motivation: 当前对话情感建模的基准大多以说话者为中心,未能充分表征对话伙伴之间耦合的、随时间演化的过程,如方向性影响、对话时间协调和融洽度发展。本工作旨在填补这一空白。

Result: 基线实验结果为各子任务建立了初始性能参考:影响预测的最佳测试结果为0.40 CCC和0.50 Pearson;话轮转换预测为0.66 Macro-F1和1.50秒MAE;融洽度轨迹建模为0.68 CCC和0.70 Pearson。这些结果表明当前方法能捕捉粗略的双人模式,但对方向性依赖和长程人际动态的鲁棒建模仍具挑战。

Insight: 创新点在于构建了一个专注于双人互动耦合动态的综合性基准,通过三个协调的子挑战系统性地建模方向性影响、时序协调和长期关系发展。该基准支持多模态建模、时序推理和跨语境泛化,并为数据有效性、评估协议和文化感知建模提供了严格的跨学科比较平台。

Abstract: The 2026 ACII Dyadic Conversations (ACII-DaiKon) Workshop & Challenge introduces a benchmark for modeling interpersonal affect and social dynamics in dyadic conversations. Although conversational affect modeling has advanced rapidly, most benchmarks remain speaker-centric and underrepresent coupled, time-evolving processes between partners, including directional influence, conversational timing coordination, and rapport development. To address this gap, ACII-DaiKon presents three coordinated sub-challenges built on a shared dataset: (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction across full interactions. The challenge is built on the Hume-DaiKon dataset, comprising 945 dyadic conversations (743.4 hours of audiovisual data) collected under naturalistic conditions across five languages. The benchmark supports multimodal modeling, temporal reasoning, and cross-context generalization through fixed train/validation/test splits, standardized metrics, and released baseline systems. Evaluation uses Concordance Correlation Coefficient (CCC), Pearson correlation, Macro-F1, and Mean Absolute Error (MAE) depending on the sub-challenge. Baseline experiments establish initial reference performance, with best test results of 0.40 CCC and 0.50 Pearson for influence prediction, 0.66 Macro-F1 and 1.50~s MAE for turn-taking, and 0.68 CCC and 0.70 Pearson for rapport trajectory modeling. These results indicate that while current methods capture coarse dyadic patterns, robust modeling of directional dependence and long-horizon interpersonal dynamics remains challenging. The workshop provides a shared platform for rigorous comparison and cross-disciplinary discussion on data validity, evaluation protocols, and culturally aware modeling for dyadic interaction.


[191] Mitigating Misalignment Contagion by Steering with Implicit Traits cs.AI | cs.CLPDF

Maria Chang, Ronny Luss, Miao Lui, Keerthiram Murugesan, Karthikeyan Ramamurthy

TL;DR: 该论文研究了在多智能体交互中语言模型(LMs)的错位传染现象,即未对齐行为在多轮对话中传播的风险。作者通过多轮对话社交困境游戏发现,语言模型在游戏后变得更加反社会,且当其他玩家被引导恶意行为时该效应会加剧。为缓解此问题,论文提出了一种名为’隐式特质引导’的技术,通过间歇性注入强化模型初始特质的系统提示,比重复系统提示更有效地维持模型的亲社会行为,且无需访问模型参数或内部状态。

Details

Motivation: 现有对齐研究主要关注单个语言模型与单个用户的交互,未能解决多轮交互中多个语言模型之间未对齐行为传播的风险,论文旨在探索和缓解这种’错位传染’现象。

Result: 在多轮对话社交困境游戏的实验中,发现语言模型在游戏后变得更加反社会,且恶意引导会加剧此效应。提出的隐式特质引导方法比单纯重复系统提示更有效地维持模型的初始亲社会行为。

Insight: 创新点在于识别并实证了多智能体环境中的’错位传染’风险,并提出了一种无需模型内部访问的实用缓解技术——隐式特质引导,通过间歇性强化初始特质来维持对齐,适用于黑盒模型的多智能体工作流设计。

Abstract: Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM’s system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro-social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi-agent workflows are being designed with black box models.


[192] When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition cs.AI | cs.CL | eess.ASPDF

Pehuén Moure, Niclas Pokel, Bilal Bounajma, Yingqiang Gao, Roman Boehringer

TL;DR: 本文研究了音频-语言模型在构音障碍语音识别中利用多模态临床上下文的能力,发现现有模型无法有效利用诊断标签和临床描述等额外信息来提升转录准确性,但通过上下文相关的LoRA微调可显著降低词错误率。

Details

Motivation: 针对构音障碍等非典型语音,现有自动语音识别系统性能脆弱,本文旨在探究音频-语言模型是否能在推理时利用临床上下文信息来改善识别效果。

Result: 在Speech Accessibility Project数据集上的实验表明,当前模型的提示方法对词错误率改善甚微甚至恶化;而采用混合临床提示格式的LoRA微调将词错误率降至0.066,相对冻结基线降低52%,在唐氏综合征和轻度障碍说话者子组中提升显著。

Insight: 创新点在于构建了评估临床上下文利用能力的基准,并揭示了当前模型在上下文利用上的局限性;通过上下文相关微调策略实现了性能提升,为开发包容性ASR提供了测试框架。

Abstract: Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with a mixture of clinical prompt formats achieves a WER of 0.066, a 52% relative reduction over the frozen baseline, while preserving performance when context is unavailable. Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers. These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.


cs.AR [Back]

[193] ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA cs.AR | cs.CV | cs.LGPDF

Shengzhe Lyu, Yuhan She, Patrick S. Y. Hung, Ray C. C. Cheung, Weitao Xu

TL;DR: 本文提出了ViM-Q,一种可扩展的算法-硬件协同设计方法,用于在FPGA上高效部署Vision Mamba模型进行端到端推理。该方法通过硬件感知的量化方案(包括动态逐令牌激活量化和逐通道平滑)以及自定义的4位每块加法幂次二权重量化来解决量化挑战,并设计了一个运行时可参数化的FPGA加速器,包含线性引擎和细粒度流水线SSM引擎。在AMD ZCU102 FPGA上的实验表明,相比量化后的NVIDIA RTX 3090 GPU基线,ViM-Q在ViM-tiny模型上实现了平均4.96倍的加速和59.8倍的能效提升。

Details

Motivation: Vision Mamba模型利用状态空间模型的线性复杂度相比Transformer具有效率优势,但其在FPGA上的高效部署面临挑战:线性层存在动态激活异常值导致静态量化失效,均匀量化在低比特下无法捕捉权重分布,且关联扫描算法在GPU上的加速模式与FPGA所需的流式数据流不匹配。

Result: 在AMD ZCU102 FPGA上部署ViM-tiny模型进行低批次推理,相比量化后的NVIDIA RTX 3090 GPU基线,ViM-Q实现了平均4.96倍的加速和59.8倍的能效提升。

Insight: 创新点包括:1) 硬件感知的量化方案结合动态逐令牌激活量化和逐通道平滑以缓解异常值,以及自定义的4位每块加法幂次二权重量化;2) 设计了一个运行时可参数化的FPGA加速器,其线性引擎使用查找表单元将乘法替换为移位加法操作,细粒度流水线SSM引擎在保持序列递归的同时并行化状态维度,从而适应ViM系列的不同维度和输入分辨率。这为在资源受限的边缘设备上部署ViM模型提供了可行路径。

Abstract: Vision Mamba (ViM) models offer a compelling efficiency advantage over Transformers by leveraging the linear complexity of State Space Models (SSMs), yet efficiently deploying them on FPGAs remains challenging. Linear layers struggle with dynamic activation outliers that render static quantization ineffective, while uniform quantization fails to capture the weight distribution at low bit-widths. Furthermore, while associative scan accelerates SSMs on GPUs, its memory access patterns are misaligned with the streaming dataflow required by FPGAs. To address these challenges, we present ViM-Q, a scalable algorithm-hardware co-design for end-to-end ViM inference on the edge. We introduce a hardware-aware quantization scheme combining dynamic per-token activation quantization and per-channel smoothing to mitigate outliers, alongside a custom 4-bit per-block Additive Power-of-Two (APoT) weight quantization. The models are deployed on a runtime-parameterizable FPGA accelerator featuring a linear engine employing a Lookup-Table (LUT) unit to replace multiplications with shift-add operations, and a fine-grained pipelined SSM engine that parallelizes the state dimension while preserving sequential recurrence. Crucially, the hardware supports runtime configuration, adapting to diverse dimensions and input resolutions across the ViM family. Implemented on an AMD ZCU102 FPGA, ViM-Q achieves an average 4.96x speedup and 59.8x energy efficiency gain over a quantized NVIDIA RTX 3090 GPU baseline for low-batch inference on ViM-tiny. This co-design shows a viable path for deploying ViM models on resource-constrained edge devices.