cs.CL [Total: 15]
cs.CV [Total: 55]
cs.DB [Total: 1]
eess.IV [Total: 1]
cs.IR [Total: 1]
cs.SE [Total: 1]
cs.LG [Total: 4]
cs.AI [Total: 1]
cs.CR [Total: 1]
cs.RO [Total: 1]
cs.HC [Total: 1]

cs.CL [Back]

[1] Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing cs.CLPDF

Ruchira Dhar, Anders Søgaard

TL;DR: 本文通过范围综述，梳理了自然语言处理（NLP）领域长期存在的关于评估方法的讨论，构建了一个评估关注点的分类法，旨在为当前大语言模型（LLM）评估的争论提供历史背景和结构化参考。

Details

Motivation: 针对近期大语言模型（LLM）评估方法受到的质疑，本文指出这些批评在NLP领域已有长期讨论，旨在通过系统梳理历史文献，为当代评估实践提供更坚实的理论基础和结构化视角。

Result: 研究的主要成果是提出了一个关于NLP评估关注点的分类法，并基于此开发了一个结构化的检查清单，以支持更审慎的评估设计和结果解释。

Insight: 创新点在于将当前LLM评估的争论置于NLP领域长期的方法论反思历史中，通过系统性的综述和分类法构建，提供了一个整合的、结构化的参考框架，有助于更全面、理性地思考评估实践。

Abstract: Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.

[2] MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese cs.CL | cs.IRPDF

Tiago Teixeira, Ana Carolina Erthal, Juan Belieni, Beatriz Canaverde, Diego Mesquita

TL;DR: 本文介绍了MATH-PT，一个针对欧洲葡萄牙语和巴西葡萄牙语的数学推理基准数据集，包含1,729个来自葡萄牙和巴西数学竞赛和考试的原生问题。论文评估了当前最先进的大语言模型在该数据集上的表现，发现前沿推理模型在多项选择题上表现良好，但在涉及图形或开放式问题时性能下降。

Details

Motivation: 解决现有数学推理评估中存在的显著语言偏见问题，即大多数基准数据集仅限英语或从英语翻译而来，缺乏高质量的非英语数学推理数据集。

Result: 在MATH-PT基准上，前沿推理模型在多项选择题上表现强劲，但处理包含图形或开放式问题时性能下降；开源模型整体表现较弱。

Insight: 创新点在于构建了首个高质量、原生来源的葡萄牙语数学推理基准，揭示了模型在非英语数学问题上的性能差异，特别是对图形和开放式问题的处理能力不足，为多语言数学推理研究提供了重要资源。

Abstract: The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. {\sc Math-PT} is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on {\sc Math-PT}, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs.

[3] CogRAG+: Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA cs.CLPDF

Xudong Wang, Zilong Wang, Zhaoyan Ming

TL;DR: 本文提出了CogRAG+，一个无需训练的框架，旨在解决大语言模型在专业领域问答任务中推理过程不透明、检索与推理纠缠导致的知识鸿沟和逻辑不一致问题。该框架通过强化检索和认知分层的约束推理，将检索增强生成流程与人类认知层次解耦对齐，从而提升模型在专业考试（如注册营养师资格考试）上的表现。

Details

Motivation: 现有大语言模型在专业任务中，其检索和推理过程紧密耦合且不透明，容易导致知识缺失和推理不一致。本文旨在通过一个与人类认知层次对齐的框架，解决这些问题，以提升模型在专业领域问答中的表现。

Result: 在注册营养师资格考试数据集上，CogRAG+在Qwen3-8B和Llama3.1-8B模型上均优于通用模型和标准RAG方法。具体而言，在单题模式下，它将Qwen3-8B的整体准确率提升至85.8%，将Llama3.1-8B提升至60.3%，并显著优于基线模型。其约束推理模块还将未回答率从7.6%降低至1.4%。

Insight: 论文宣称的创新点在于提出了一个无需训练、模型无关的框架，通过引入强化检索（包含以事实为中心和以选项为中心的双路径策略）和认知分层的约束推理（使用结构化模板替代无约束的思维链生成），将RAG流程与人类认知层次解耦对齐。从客观角度看，这种将检索过程结构化、分层化以模拟人类认知，并严格约束推理路径以减少冗余和错误的方法，为解决专业领域模型幻觉和逻辑不一致问题提供了新的思路。

Abstract: Professional domain knowledge underpins human civilization, serving as both the basis for industry entry and the core of complex decision-making and problem-solving. However, existing large language models often suffer from opaque inference processes in which retrieval and reasoning are tightly entangled, causing knowledge gaps and reasoning inconsistencies in professional tasks. To address this, we propose CogRAG+, a training-free framework that decouples and aligns the retrieval-augmented generation pipeline with human cognitive hierarchies. First, we introduce Reinforced Retrieval, a judge-driven dual-path strategy with fact-centric and option-centric paths that strengthens retrieval and mitigates cascading failures caused by missing foundational knowledge. We then develop cognition-stratified Constrained Reasoning, which replaces unconstrained chain-of-thought generation with structured templates to reduce logical inconsistency and generative redundancy. Experiments on two representative models, Qwen3-8B and Llama3.1-8B, show that CogRAG+ consistently outperforms general-purpose models and standard RAG methods on the Registered Dietitian qualification exam. In single-question mode, it raises overall accuracy to 85.8% for Qwen3-8B and 60.3% for Llama3.1-8B, with clear gains over vanilla baselines. Constrained Reasoning also reduces the unanswered rate from 7.6% to 1.4%. CogRAG+ offers a robust, model-agnostic path toward training-free expert-level performance in specialized domains.

[4] LLMs Generate Kitsch cs.CLPDF

Xenia Klinge, Stefan Ortlieb, Alexander Koller

TL;DR: 这篇论文探讨了大型语言模型（LLMs）在生成图片、文本、音乐、视频等传统上需要人类创造力的作品时，倾向于系统性地生成‘媚俗’（kitsch）内容的现象。作者认为这是由LLMs的训练方式导致的，并通过实证研究表明，在控制对‘媚俗’的定义后，读者确实认为LLM生成的故事更媚俗。论文讨论了这一发现对未来研究设计和创造性任务（如研究和编码）的影响。

Details

Motivation: 解决LLM生成作品在研究中评分高但实际感知中显得‘通用’和‘空洞’的矛盾，探究其系统生成‘媚俗’内容的根本原因。

Result: 实证研究表明，在控制‘媚俗’定义后，读者认为LLM生成的故事比人类作品更媚俗，这支持了论文的核心论点。

Insight: 创新点在于将LLM生成内容的感知缺陷概念化为‘媚俗’，并归因于训练机制；这提示未来研究需更细致地评估生成内容的质量，并可能影响创造性任务中LLM的应用设计。

Abstract: Large Language Models (LLMs) are increasingly used to generate pictures, texts, music, videos, and other works that have traditionally required human creativity. LLM-generated artifacts are often rated better than human-generated works in controlled studies. At the same time, they can come across as generic and hollow. We propose to resolve this tension by arguing that LLMs systematically generate kitsch, and that this is a consequence of the way in which they are trained. We also show empirically that readers perceive LLM-generated stories as kitschier, if we control for their definition of “kitsch”. We discuss implications for the design of future studies and for creative tasks such as research and coding.

[5] Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs cs.CLPDF

Ashish Balkishan Lathkar

TL;DR: 本文发现大型语言模型存在一种新的校准特性：在提供多步推理链中的一个已确认中间事实时，模型在完整证据消除错误前会表现出更高的自信错误答案率，作者称之为锚定虚构。该现象被形式化为参数幻觉置信度，并通过因果注入实验和模型能力扩展分析得到验证。研究还提出了锚定阈值定律来预测PHC随推理步数的变化，并开发了利用PHC的LearnedRouter方法，在无需微调的情况下显著提升RAG路由性能。

Details

Motivation: 揭示大型语言模型在部分证据下非单调地放大自信幻觉的未知校准特性，解决模型在中间事实锚定后产生错误但自信的推理链完成的问题。

Result: 在因果注入实验中PHC从0.613变化至0.656、0.595和0.536（N=160）；跨五个模型家族的能力扩展分析显示Spearman相关系数rho=0.900（p=0.037）；在四个基准测试的1800个查询上，LearnedRouter方法达到宏观F1分数0.426（p<1e-6），以无模型微调和仅需1/50标签量的方式缩小了81.1%的性能差距。

Insight: 创新点包括：1）提出锚定虚构现象和PHC量化指标；2）发现锚定阈值定律k*(n)=floor(n/3)可预测PHC随推理深度的放大；3）开发利用PHC的LearnedRouter实现高效RAG路由；4）证明认知谦逊提示可将PHC峰值降低0.118，且显式自评分（PHC=0.684）优于词汇置信度作为路由信号。

Abstract: We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning chain increases the model’s confident-wrong-answer rate before full evidence eliminates it. We call this anchored confabulation: a partial anchor commits the model to confident parametric completion of remaining reasoning steps. We formalize it as Parametric Hallucination Confidence (PHC) and establish it across six lines of evidence including a causal injection experiment (PHC 0.613 to 0.656 to 0.595 to 0.536, N=160) and capability scaling across five model families (Spearman rho=0.900, p=0.037). The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions. Applied to RAG routing, a LearnedRouter exploiting PHC closes 81.1% of the oracle performance gap (macro F1=0.426, p<1e-6) on 1,800 queries across four benchmarks with no model fine-tuning and 50x fewer labels than prior RL-based work. An epistemic humility prompt reduces the PHC spike by -0.118; explicit self-rating (PHC=0.684, p<0.001) outperforms lexical confidence as a routing signal.

[6] A Systematic Comparison of Prompting and Multi-Agent Methods for LLM-based Stance Detection cs.CLPDF

Genan Dai, Zini Chen, Yi Yang, Bowen Zhang

TL;DR: 本文对基于大语言模型的立场检测方法进行了系统性比较，评估了提示推理（如直接提示、Auto-CoT、StSQA）和基于代理的辩论（如COLA、MPRF）两大类共五种方法，在四个数据集、14个子任务上，使用15个参数规模从7B到72B+的大语言模型进行了实验。

Details

Motivation: 现有研究在数据划分、基础模型和评估协议上存在差异，导致基于大语言模型的立场检测方法难以进行公平比较，本文旨在通过系统性实验解决这一问题。

Result: 实验发现：1）在所有有完整结果的模型上，最佳的提示方法优于最佳的代理方法，且代理方法每个样本需要多7到12倍的API调用；2）模型规模对性能的影响大于方法选择，性能增益在约32B参数时趋于平缓；3）推理增强模型（如DeepSeek-R1）在此任务上并未持续优于同规模通用模型。

Insight: 论文的创新点在于首次对LLM立场检测的提示与多代理方法进行了大规模、标准化的公平比较，揭示了模型规模的关键作用、提示方法的效率优势以及当前推理模型在特定任务上的局限性，为方法选择提供了实证依据。

Abstract: Stance detection identifies the attitude of a text author toward a given target. Recent studies have explored various LLM-based strategies for this task, from zero-shot prompting to multi-agent debate. However, existing works differ in data splits, base models, and evaluation protocols, making fair comparison difficult. We conduct a systematic comparison that evaluates five methods across two categories – prompt-based inference (Direct Prompting, Auto-CoT, StSQA) and agent-based debate (COLA, MPRF) – on four datasets with 14 subtasks, using 15 LLMs from six model families with parameter sizes from 7B to 72B+. Our experiments yield several findings. First, on all models with complete results, the best prompt-based method outperforms the best agent-based method, while agent methods require 7 to 12 times more API calls per sample. Second, model scale has a larger impact on performance than method choice, with gains plateauing around 32B. Third, reasoning-enhanced models (DeepSeek-R1) do not consistently outperform general models of the same size on this task.

[7] Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens cs.CLPDF

Zhenyu Zhao, Sander Land, Dan Bikel, Waseem Alshikh

TL;DR: 该论文提出了一种通过熵引导的超令牌来压缩大型语言模型推理过程的方法。该方法将推理令牌分为低熵的结构令牌和高熵的有机令牌，并利用跨词BPE合并从模型的推理轨迹中提取超令牌，然后通过监督微调让模型学会使用这些超令牌。在三个模型系列和五个数学推理基准测试中，该方法平均缩短推理轨迹8.1%，且未造成显著的准确率损失。超令牌还可作为可解释的推理步骤标注，用于分析模型策略和诊断错误轨迹。

Details

Motivation: 大型语言模型的推理过程会产生巨大的推理时计算开销，而推理轨迹中令牌级别的信息结构尚未被充分探索。作者观察到推理令牌在功能上存在不对称性，这启发了对推理过程进行压缩的动机。

Result: 在三个模型系列和五个数学推理基准测试上，该方法平均缩短推理轨迹8.1%，且在任何模型-基准组合上均未造成统计上显著的准确率损失。

Insight: 创新点在于基于熵的令牌功能分类（结构令牌 vs. 有机令牌）以及由此启发的模型无关压缩流程。该方法不仅实现了无损压缩，其生成的超令牌还提供了可解释的推理步骤标注，可用于分析模型的高层策略（如回溯、验证、策略转换）和诊断错误模式（如混淆循环），为基于强化学习的推理训练中的奖励塑造和早停等应用提供了潜在的诊断信号。

Abstract: Reasoning in Large Language Models incurs significant inference-time compute, yet the token-level information structure of reasoning traces remains underexplored. We observe that reasoning tokens split into two functional types: low-entropy \textit{structural} tokens (recurring phrases that scaffold the reasoning process) and higher-entropy \textit{organic} tokens (problem-specific content that drives toward a solution). This asymmetry motivates a simple, model-agnostic compression pipeline: apply cross-word BPE merges on a model’s own reasoning traces to derive \textit{supertokens} that capture frequent structural patterns, then teach the model to adopt them via supervised fine-tuning. Across three model families and five mathematical reasoning benchmarks, our approach shortens reasoning traces by 8.1% on average with no statistically significant accuracy loss on any model–benchmark pair. Beyond compression, supertokens act as interpretable reasoning-move annotations (backtracking, verification, strategy shifts), exposing the model’s high-level strategy at a glance. Analyzing transitions between structural categories reveals systematic differences between correct and incorrect traces: correct traces show productive recovery (backtracking followed by strategy shifts and verification), while incorrect traces are dominated by confusion cycles (repeated hedging and unresolved contradictions). These diagnostic signals suggest applications in reward shaping and early stopping for RL-based reasoning training.

[8] Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI cs.CL | cs.AI | cs.IRPDF

Saurabh K. Singh, Sachin Raj

TL;DR: 该论文提出了EnterpriseDocBench，一个用于评估企业文档AI处理流水线的统一框架。该框架在包含六个企业领域文档的语料库上，同时评估解析保真度、索引效率、检索相关性和生成真实性。研究测试了三种检索流水线（BM25、稠密嵌入和混合方法），并发现混合检索略优，同时揭示了幻觉率与文档长度非单调相关、各阶段质量关联性弱等关键发现。

Details

Motivation: 当前企业文档AI通常由解析、索引、检索、生成等多个阶段组成流水线，每个阶段虽被深入研究，但缺乏对整个系统端到端的统一评估方法。论文旨在解决如何系统性地评估复杂多模态文档处理流水线整体性能的问题。

Result: 在EnterpriseDocBench基准测试中，混合检索在nDCG@5指标上以0.92略微优于BM25（0.91），两者均优于稠密嵌入（0.83）。生成阶段的事实准确率为85.5%，但答案完整性平均仅为0.40。幻觉率在短文档（28.1%）和极长文档（23.8%）中高于中等长度文档（9.2%）。各处理阶段之间的质量相关性很弱（解析->检索 r=0.14，解析->生成 r=0.17，检索->生成 r=0.02）。

Insight: 论文的主要创新在于构建了一个端到端的统一评估框架，能够同时评估文档AI流水线的多个关键阶段。一个重要的客观发现是，流水线各阶段的质量并不像通常假设的那样具有强级联效应，这挑战了传统认知。此外，系统在事实准确性（高）与答案完整性（低）之间的显著差距，为实际部署提供了比单纯准确率更重要的评估维度。

Abstract: Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own – what’s still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it – BM25, dense embedding, and a hybrid – all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn’t grow monotonically with document length – short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing->retrieval r=0.14, parsing->generation r=0.17, retrieval->generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren’t. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don’t oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers – it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.

[9] Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation cs.CL | cs.AIPDF

Akhil Rajeev P, Annarao Kulkarni

TL;DR: 本文介绍了Naamah，一个通过结合DBpedia实体提取和24B参数混合推理模型生成的大规模、高质量梵语命名实体识别（NER）银标准数据集，包含102,942个句子，并用于评估XLM RoBERTa和IndicBERTv2两种Transformer架构的性能。

Details

Motivation: 古典梵语文献的数字化因缺乏标注资源（尤其是命名实体识别）而受阻，现有基于通用大语言模型的数据增强方法错误率高且缺乏对古典语法的深度推理能力。

Result: 论文利用生成的Naamah数据集对大规模多语言模型XLM RoBERTa和参数高效的IndicBERTv2进行了基准测试，但摘要中未提及具体的定量结果或与SOTA的比较水平。

Insight: 创新点在于提出了一种结合知识库（DBpedia）实体提取与专用混合推理大模型生成的方法，以创建语法自然、合成多样性的训练数据，这为低资源古典语言处理提供了一种高质量数据合成的新途径。

Abstract: The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and synthetically diverse training data. We utilize this dataset to benchmark two transformer architectures: the massive multilingual XLM RoBERTa and the parameter efficient IndicBERTv2.

[10] Multimodal LLMs are not all you need for Pediatric Speech Language Pathology cs.CLPDF

Darren Fürst, Sebastian Steindl, Ulrich Schäfer

TL;DR: 本文提出了一种用于儿童言语障碍（SSD）分类的层次化方法，在SLPHelmUltraSuitePlus基准测试中，通过微调语音表示模型（SRM）并结合针对性的数据增强，显著优于基于LLM的现有方法，并缓解了先前研究中发现的偏见。

Details

Motivation: 解决言语障碍（SSD）影响儿童但言语病理学家严重短缺、工作量过大的问题，旨在通过自动化方法辅助临床诊断。

Result: 在SLPHelmUltraSuitePlus基准的所有临床任务上，SRM模型均大幅超越基于LLM的SOTA方法。

Insight: 创新点在于采用从二分类到类型、症状分类的级联层次化方法，并结合针对性的数据增强来缓解偏见；客观来看，该方法强调了针对特定领域（如语音病理学）微调专用SRM模型相对于通用多模态LLM的有效性。

Abstract: Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable caseloads. We test a hierarchical approach to SSD classification on the granular multi-task SLPHelmUltraSuitePlus benchmark. We propose a cascading approach from binary classification to type, and symptom classification. By fine-tuning Speech Representation Models (SRM), and using targeted data augmentation we mitigate biases found by previous works, and improve upon all clinical tasks in the benchmark. We also treat Automatic Speech Recognition (ASR) with our data augmentation approach. Our results demonstrate that SRM consistently outperform the LLM-based state-of-the-art across all evaluated tasks by a large margin. We publish our models and code to foster future research.

[11] SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling cs.CLPDF

Eliya Naomi Aharon, Meytal Grimland, Avi Segal, Loona Ben Dayan, Inbar Shenfeld

TL;DR: SAGE是一个用于在线心理咨询的策略感知图增强生成框架，它通过构建异质图统一对话动态和心理理论层，并利用图感知注意力机制将结构信号投影为软提示，以指导大语言模型生成具有临床深度的回应。

Details

Motivation: 现有通用大语言模型缺乏临床推理能力，无法整合心理学框架、实时痛苦信号和策略干预规划，而有效的心理咨询需要这种复杂的理论驱动过程。

Result: 在自动指标和专家人工评估中，SAGE在策略预测和推荐回应质量上均优于基线模型。

Insight: 创新点在于将结构化临床知识（如心理学理论词典）与生成式AI结合，通过异质图显式锚定交互，并使用图感知注意力机制增强LLM的临床推理能力，可作为高风险危机咨询的决策支持工具。

Abstract: Effective mental health counseling is a complex, theory-driven process requiring the simultaneous integration of psychological frameworks, real-time distress signals, and strategic intervention planning. This level of clinical reasoning is critical for safety and therapeutic effectiveness but is often missing in general-purpose Large Language Models (LLMs). We introduce SAGE (Strategy-Aware Graph-Enhanced), a novel framework designed to bridge the gap between structured clinical knowledge and generative AI. SAGE constructs a heterogeneous graph that unifies conversational dynamics with a psychologically grounded layer, explicitly anchoring interactions in a theory-driven lexicon. Our architecture first employs a Next Strategy Classifier to identify the optimal therapeutic intervention. Subsequently, a Graph-Aware Attention mechanism projects graph-derived structural signals into soft prompts, conditioning the LLM to generate responses that maintain clinical depth. Validated through both automated metrics and expert human evaluation, SAGE outperforms baselines in strategy prediction and recommended response quality. By providing actionable intervention recommendations, SAGE serves as a cutting-edge decision-support tool designed to augment human expertise in high-stakes crisis counseling.

[12] From Black-Box Confidence to Measurable Trust in Clinical AI: A Framework for Evidence, Supervision, and Staged Autonomy cs.CL | cs.AI | cs.CYPDF

Serhii Zabolotnii, Viktoriia Holinko, Olha Antonenko

TL;DR: 本文提出了一种构建可信临床AI的实用框架，围绕证据、监督和分阶段自主性三大原则，通过结合确定性核心、患者特异性AI助手、多级模型升级机制和人工监督层，将信任设计为一种可测量的系统属性，而非仅依赖模型准确性或用户印象。

Details

Motivation: 解决临床AI中信任不能仅简化为模型精度或用户印象的问题，需要将信任工程化为基于证据、监督和操作边界的可测量系统属性。

Result: 未在摘要中提及具体的定量实验结果或基准测试，但提出了基于计量学原则（如测量不确定性、校准、可追溯性）的信任度量集，以实现对各架构层的定量评估。

Insight: 创新点在于将信任视为系统架构的产物，而非单一模型属性，通过模块化提示、分阶段自主性和嵌入式监督机制，在保持性能的同时逐步扩展临床深度，为可信AI提供了可操作的设计原则和度量方法。

Abstract: Trust in clinical artificial intelligence (AI) cannot be reduced to model accuracy, fluency of generation, or overall positive user impression. In medicine, trust must be engineered as a measurable system property grounded in evidence, supervision, and operational boundaries of AI autonomy. This article proposes a practical framework for trustworthy clinical AI built around three principles: evidence, supervision, and staged autonomy. Rather than replacing deterministic clinical logic wholesale with end-to-end black-box models, the proposed approach combines a deterministic core, a patient-specific AI assistant for contextual validation, a multi-tier model escalation mechanism, and a human supervision layer for verification, escalation, and risk control. We demonstrate that trust also depends on selective verification of clinically critical findings, bounded clinical context, disciplined prompt architecture, and careful evaluation on realistic cases. Classifier-driven modular prompting is examined as an incremental path to scaling clinical depth without sacrificing prompt performance and without waiting for complete rule-based coverage. To operationalize trust, a set of trust metrics is proposed, built on metrological principles – measurement uncertainty, calibration, traceability – enabling quantitative rather than subjective assessment of each architectural layer. In this perspective, trustworthy clinical AI emerges not as a property of an individual model, but as an architectural outcome of a system into which evidence trails, human oversight, tiered escalation, and graduated action rights are embedded from the outset.

[13] HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists cs.CL | cs.AI | cs.DLPDF

Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

TL;DR: 本文介绍了HalluCiteChecker，一个用于检测和验证科学论文中幻觉引用的轻量级工具包。该工具将幻觉引用检测形式化为NLP任务，旨在减轻审稿人和作者手动验证引用真实性的负担。

Details

Motivation: AI辅助技术（如引用推荐）在改变学术写作过程的同时，也导致了幻觉引用（即引用不存在的文献）的出现，这损害了论文可信度并增加了审稿负担。

Result: 该工具包轻量高效，可在标准笔记本电脑上仅用CPU在数秒内完成离线验证。

Insight: 创新点在于将幻觉引用检测形式化为一个具体的NLP任务，并提供了一个开箱即用、完全离线的轻量级工具包，为系统性预审稿和出版检查提供了实用基础。

Abstract: We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies have transformed the academic writing process, including citation recommendation, they have also led to the emergence of hallucinated citations that do not correspond to any existing work. Such citations not only undermine the credibility of scientific papers but also impose an additional burden on reviewers and authors, who must manually verify their validity during the review process. In this study, we formalize hallucinated citation detection as an NLP task and provide a corresponding toolkit as a practical foundation for addressing this problem. Our package is lightweight and can perform verification in seconds on a standard laptop. It can also be executed entirely offline and runs efficiently using only CPUs. We hope that HalluCiteChecker will help reduce reviewer workload and support organizers by enabling systematic pre-review and publication checks. Our code is released under the Apache 2.0 license on GitHub and is distributed as an installable package via PyPI. A demonstration video is available on YouTube.

[14] ClawGym: A Scalable Framework for Building Effective Claw Agents cs.CL | cs.AI | cs.LGPDF

Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang

TL;DR: ClawGym是一个可扩展的框架，用于支持Claw风格个人智能体的全生命周期开发，包括合成可验证的训练数据、集成智能体训练与诊断评估。具体构建了包含1.35万个过滤任务的多样化数据集ClawGym-SynData，训练了ClawGym-Agents模型系列，并建立了包含200个实例的基准测试ClawGym-Bench。

Details

Motivation: 解决Claw风格环境中多步骤工作流开发缺乏系统性框架的问题，特别是缺乏合成可验证训练数据并将其与智能体训练和诊断评估集成的系统。

Result: 构建了ClawGym-SynData数据集（13.5K任务）和ClawGym-Bench基准（200个实例），通过监督微调和强化学习训练了ClawGym-Agents模型系列，但摘要未提及具体定量结果或与SOTA的比较。

Insight: 创新点包括：1）系统化框架支持Claw智能体全生命周期；2）从人物驱动意图和技能基础操作合成多样化训练数据；3）结合自动化过滤和人类-LLM审查的基准校准方法；4）轻量级并行化沙箱强化学习管道。

Abstract: Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes.To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.

[15] Select to Think: Unlocking SLM Potential with Local Sufficiency cs.CLPDF

Wenxuan Ye, Yangyang Zhang, Xueli An, Georg Carle, Yunpu Ma

TL;DR: 该论文提出SELECT TO THINK (S2T)方法，通过识别大型语言模型(LLM)偏好的词元通常存在于小型语言模型(SLM)的top-K预测中这一‘局部充分性’现象，将LLM的角色从开放式生成重构为对SLM候选词元的选择。基于此，进一步提出S2T-LOCAL方法，将这种选择逻辑蒸馏到SLM中，使其能够自主进行重排序，从而在不依赖推理时调用LLM的情况下，显著提升SLM的推理能力。

Details

Motivation: 解决小型语言模型(SLMs)因容量限制而推理能力不足的问题，同时避免现有方法（如调用LLM或标准蒸馏）带来的高延迟、高成本或效果不佳的困境。

Result: 实验表明，一个15亿参数的SLM的top-8候选词元能以95%的命中率捕获320亿参数LLM的选择。S2T-LOCAL方法在多个基准测试上平均将贪婪解码的性能提升了24.1%，其效果与8路径自洽性方法相当，但仅需单轨迹推理的效率。

Insight: 核心创新点在于发现了‘局部充分性’现象，并据此将LLM的监督信号从复杂的生成分布简化为对SLM候选的离散排序。S2T-LOCAL通过蒸馏这种选择逻辑，使SLM获得自主重排序能力，这是一种高效且低成本的性能提升范式，弥合了SLM与LLM在推理能力上的差距。

Abstract: Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these external calls introduce substantial latency and costs. Alternatively, standard distillation is often hindered by the capacity limitation, as SLMs struggle to accurately mimic the LLM’s complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM’s preferred token consistently resides within the SLM’s top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose SELECT TO THINK (S2T), which reframes the LLM’s role from open-ended generation to selection among the SLM’s proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-LOCAL, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, we demonstrate that a 1.5B SLM’s top-8 candidates capture the 32B LLM’s choice with 95% hit rate. Translating this potential into performance, S2T-LOCAL improves greedy decoding by 24.1% on average across benchmarks, effectively matching the efficacy of 8-path self-consistency while operating with single-trajectory efficiency.

cs.CV [Back]

[16] Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding cs.CVPDF

Chang Liu, Henghui Ding, Nikhila Ravi, Yunchao Wei, Shuting He

TL;DR: 本文是2026年CVPR会议上举办的第五届PVUW挑战赛的技术报告，总结了该赛事的目标、数据集和顶尖方法。本届挑战赛在高度无约束条件下评估最先进的模型，并设立了三个专项赛道：针对密集杂乱和严重遮挡场景下目标跟踪的MOSE赛道、通过以运动为中心的语言表达进行目标定位的MeViS-Text赛道，以及新设立的、开创性地利用音频驱动目标分割的MeViS-Audio赛道。报告通过分析参赛者提交的多模态前沿解决方案，展示了该领域的最新技术进展并指明了未来方向。

Details

Motivation: 挑战赛的动机是推动像素级视频理解技术在高度无约束的真实世界条件下的发展，并通过引入新的、更具挑战性的任务（特别是音频模态）来促进更全面的多模态场景理解。

Result: 报告总结了各赛道顶尖方法的表现，这些方法代表了在MOSE、MeViS-Text和MeViS-Audio等特定基准上的最新技术水平，但未提供具体的定量结果。

Insight: 主要创新点在于将像素级视频理解的评估扩展到音频模态（MeViS-Audio赛道），这标志着该领域向更丰富的多模态感知迈出了重要一步。同时，挑战赛通过发布新的、具有挑战性的数据集，持续推动模型在复杂、动态真实场景下的鲁棒性发展。

Abstract: This report summarizes the objectives, datasets, and top-performing methodologies of the 2026 Pixel-level Video Understanding in the Wild (PVUW) Challenge, hosted at CVPR 2026, which evaluates state-of-the-art models under highly unconstrained conditions. To provide a comprehensive assessment, the 2026 edition features three specialized tracks: the MOSE track for tracking objects within densely cluttered and severely occluded scenarios; the MeViS-Text track for localizing targets via motion-focused linguistic expressions; and the newly inaugurated MeViS-Audio track, which pioneers acoustic-driven object segmentation. By introducing previously unreleased challenging data and analyzing the cutting-edge, multimodal solutions submitted by participants, this report highlights the community’s latest technical advancements and charts promising future directions for robust video scene comprehension.

Zaid Nasser, Mikhail Iumanov, Tianhao Li, Maxim Popov, Jaafar Mahmoud

TL;DR: 本文提出了RADIO-ViPE系统，这是一个在线语义SLAM系统，能够在动态环境中实现几何感知的开放词汇定位，将任意自然语言查询与局部化的3D区域和物体关联起来。该系统直接处理原始单目RGB视频流，无需先验的相机内参、深度传感器或位姿初始化，并通过紧密耦合来自聚合基础模型（如RADIO）的多模态嵌入与几何场景信息来构建一致的地图。

Details

Motivation: 解决现有方法需要标定的、带位姿的RGB-D输入，以及通常假设静态场景的限制，旨在实现更灵活、适用于真实世界动态环境的开放词汇语义SLAM。

Result: 在动态TUM-RGBD基准测试中取得了最先进（SOTA）的结果，同时与依赖标定数据和静态场景假设的离线开放词汇方法相比保持了有竞争力的性能。

Insight: 创新点在于将来自视觉和语言基础模型的多模态嵌入与几何场景信息在初始化、优化和因子图连接中进行紧密耦合，并使用自适应鲁棒核来处理动态物体和场景元素变化，从而在仅使用单目RGB视频流的条件下实现鲁棒的开放词汇语义建图。

Abstract: We present RADIO-ViPE (Reduce All Domains Into One – Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings – spanning vision and language – derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This coupling takes place in initialization, optimization and factor graph connections to improve the consistency of the map from multiple modalities. The optimization is wrapped within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics and unconstrained in-the-wild video streams. Project page: https://be2rlab.github.io/radio_vipe

[18] FruitProM-V2: Robust Probabilistic Maturity Estimation and Detection of Fruits and Vegetables cs.CV | cs.AI | cs.ROPDF

Rahul Harsha Cheppally, Sidharth Rai, Sudan Baral, Benjamin Vail, Ajay Sharda

TL;DR: 本文提出FruitProM-V2，一种用于水果和蔬菜成熟度估计与检测的鲁棒概率模型。该方法将成熟度建模为潜在连续变量，通过分布检测头进行概率预测，并使用累积分布函数（CDF）转换为类别概率，以解决传统多分类方法在视觉相似阶段强加尖锐边界的问题。

Details

Motivation: 传统基于视觉的成熟度估计通常被表述为多分类任务，在视觉相似的阶段之间强加了尖锐边界，而实际成熟是一个连续的生物过程，导致标注存在不确定性。作者通过对番茄数据集的标注可靠性研究，观察到标注分歧集中在相邻成熟阶段之间，从而提出概率建模方法。

Result: 在干净标签下，所提模型与标准检测器性能相当，但能更好地表示不确定性。在训练中引入受控标签噪声时，概率模型相对于基线表现出更强的鲁棒性，表明显式建模成熟度不确定性可以提高视觉成熟度估计的可靠性。

Insight: 创新点在于将成熟度视为连续潜在变量进行概率建模，使用分布检测头和CDF转换来捕获标注不确定性，这为处理视觉分类中固有的模糊性问题（尤其是相邻类别）提供了一种新思路，增强了模型在噪声标签下的鲁棒性。

Abstract: Accurate fruit maturity identification is essential for determining harvest timing, as incorrect assessment directly affects yield and post-harvest quality. Although ripening is a continuous biological process, vision-based maturity estimation is typically formulated as a multi-class classification task, which imposes sharp boundaries between visually similar stages. To examine this limitation, we perform an annotation reliability study with two independent annotators on a held-out tomato dataset and observe disagreement concentrated near adjacent maturity stages. Motivated by this observation, we model maturity as a latent continuous variable and predict it probabilistically using a distributional detection head, converting the distribution into class probabilities through the cumulative distribution function (CDF). The proposed formulation maintains comparable performance to a standard detector under clean labels while better representing uncertainty. Furthermore, when controlled label noise is introduced during training, the probabilistic model demonstrates improved robustness relative to the baseline, indicating that explicitly modeling maturity uncertainty leads to more reliable visual maturity estimation.

[19] Sample Selection Using Multi-Task Autoencoders in Federated Learning with Non-IID Data cs.CV | cs.LGPDF

Emre Ardıç, Yakup Genç

TL;DR: 本文提出了一种用于联邦学习中非独立同分布数据下的样本选择方法，通过多任务自编码器结合损失和特征分析来估计样本贡献度，并利用无监督异常检测技术过滤噪声样本，以提升模型性能。

Details

Motivation: 联邦学习在保护数据隐私的同时，常因冗余、恶意或异常样本导致模型性能下降和效率低下，本文旨在解决这些问题。

Result: 在CIFAR10和MNIST数据集上，针对不同客户端数量、非独立同分布和高达40%噪声水平进行验证，基于损失的样本选择方法在CIFAR10上使用OCSVM实现了最高7.02%的准确率提升，在MNIST上使用AT实现了1.83%的提升；联邦SVDD损失进一步提升了基于特征的样本选择，在CIFAR10上使用OCSVM实现了0.99%的准确率增益。

Insight: 创新点包括引入多任务自编码器进行样本贡献度估计，结合无监督异常检测方法（如OCSVM、IF和AT）进行噪声过滤，以及提出由中央服务器控制的多类深度SVDD损失来增强基于特征的样本选择，有效提升了联邦学习在非独立同分布和噪声环境下的鲁棒性。

Abstract: Federated learning is a machine learning paradigm in which multiple devices collaboratively train a model under the supervision of a central server while ensuring data privacy. However, its performance is often hindered by redundant, malicious, or abnormal samples, leading to model degradation and inefficiency. To overcome these issues, we propose novel sample selection methods for image classification, employing a multitask autoencoder to estimate sample contributions through loss and feature analysis. Our approach incorporates unsupervised outlier detection, using one-class support vector machine (OCSVM), isolation forest (IF), and adaptive loss threshold (AT) methods managed by a central server to filter noisy samples on clients. We also propose a multi-class deep support vector data description (SVDD) loss controlled by a central server to enhance feature-based sample selection. We validate our methods on CIFAR10 and MNIST datasets across varying numbers of clients, non-IID distributions, and noise levels up to 40%. The results show significant accuracy improvements with loss-based sample selection, achieving gains of up to 7.02% on CIFAR10 with OCSVM and 1.83% on MNIST with AT. Additionally, our federated SVDD loss further improves feature-based sample selection, yielding accuracy gains of up to 0.99% on CIFAR10 with OCSVM. These results show the effectiveness of our methods in improving model accuracy across various client counts and noise conditions.

[20] Privacy-Preserving Clothing Classification using Vision Transformer for Thermal Comfort Estimation cs.CV | cs.CRPDF

Tatsuya Chuman, Yousuke Udagawa, Hitoshi Kiya

TL;DR: 本文提出了一种基于视觉变换器（ViT）的隐私保护服装分类方案，用于热舒适度估计，旨在支持以居住者为中心的智能控制系统。该方法在加密图像上实现了高精度分类，避免了传统像素级隐私保护方法导致的严重准确率下降。

Details

Motivation: 现有利用摄像头图像进行HVAC控制以优化热舒适度的研究未充分考虑居住者图像的隐私保护，而传统隐私保护方法应用于图像分类时会导致准确率显著下降。

Result: 在按服装隔热性能分类的DeepFashion数据集上的实验表明，传统基于像素的方法准确率严重下降，而本方案在加密图像上保持了高准确率，在所有类别上与原始明文图像的准确率相比没有下降。

Insight: 创新点在于将视觉变换器（ViT）应用于隐私保护的服装隔热估计任务，在加密域实现了与明文图像相当的分类精度，为隐私敏感的智能环境控制提供了可行的解决方案。

Abstract: A privacy-preserving clothing classification scheme is presented to enable secure occupant-centric control (OCC) systems. Although the utilization of camera images for HVAC control has been widely studied to optimize thermal comfort, privacy protection of occupant images has not been considered in prior works. While various privacy-preserving methods have been proposed for image classification, applying conventional schemes results in severe accuracy degradation. In this paper, we introduce a privacy-preserving classification method using Vision Transformer (ViT) applied to clothing insulation estimation. In an experiment using the DeepFashion dataset categorized by clothing insulation, while the conventional pixel-based method suffers a severe accuracy drop, our scheme maintains a high accuracy on encrypted images, showing no degradation from plain images across all categories.

[21] FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing cs.CV | cs.HC | cs.IR | cs.MMPDF

Morayo Danielle Adeyemi, Ryan A. Rossi, Franck Dernoncourt

TL;DR: FASH-iCNN是一个多模态CNN系统，通过分析Vogue杂志的87,547张时装秀图像，使时尚编辑身份（如品牌、时代、色彩传统）可被检测和解释。该系统能根据服装照片识别其所属品牌、年代和色彩传统，并揭示纹理和亮度是传递编辑身份的主要视觉通道。

Details

Motivation: 解决现有时尚AI系统通常编码特定品牌、编辑和历史时刻的美学逻辑但不公开的问题，旨在使这种文化逻辑变得可检查。

Result: 在15个时尚品牌和34年时间跨度的数据集上，仅使用服装的模型在品牌识别上达到78.2% top-1准确率，年代识别达到88.6% top-1，具体年份识别达到58.3% top-1且平均误差仅2.2年；实验表明移除颜色仅损失10.6个百分点品牌识别准确率，而移除纹理损失37.6个百分点。

Insight: 将编辑文化视为信号而非背景噪声，通过多模态CNN探针技术实现时尚身份的可解释性；创新点在于揭示纹理和亮度是时尚编辑身份的主要载体，为AI可解释性在时尚领域的应用提供新视角。

Abstract: Fashion AI systems routinely encode the aesthetic logic of specific houses, editors, and historical moments without disclosing it. We present FASH-iCNN, a multimodal system trained on 87,547 Vogue runway images across 15 fashion houses spanning 1991-2024 that makes this cultural logic inspectable. Given a photograph of a garment, the system recovers which house produced it, which era it belongs to, and which color tradition it reflects. A clothing-only model identifies the fashion house at 78.2% top-1 across 14 houses, the decade at 88.6% top-1, and the specific year at 58.3% top-1 across 34 years with a mean error of just 2.2 years. Probing which visual channels carry this signal reveals a sharp dissociation: removing color costs only 10.6pp of house identity accuracy, while removing texture costs 37.6pp, establishing texture and luminance as the primary carriers of editorial identity. FASH-iCNN treats editorial culture as the signal rather than background noise, identifying which houses, eras, and color traditions shaped each output so that users can see not just what the system predicts but which houses, editors, and historical moments are encoded in that prediction.

[22] ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection cs.CVPDF

Ganxi Xu, Zhao-Rong Lai, Yuting Tang, Yonghao Song, Shuyan Zhou

TL;DR: 本文提出了一个名为ViBE的新型大脑编码框架，用于从视觉刺激生成脑磁图（MEG）和脑电图（EEG）信号。该框架包含一个时空卷积变分自编码器（TSC-VAE）用于重建神经响应，并利用Q-Former将CLIP图像嵌入映射到TSC-VAE的潜在空间以生成神经代理嵌入，最后通过均方误差损失和切片Wasserstein距离实现跨模态对齐。

Details

Motivation: 大脑编码模型旨在解码视觉刺激如何转化为神经响应，并为严重视力障碍患者的视觉假体提供关键技术支持。核心挑战在于实现神经响应的忠实重建以及建立视觉刺激与神经响应之间的跨模态对齐。

Result: 在THINGS-EEG2和THINGS-MEG数据集上进行了广泛实验，结果表明该方法能够从视觉刺激生成高质量的M/EEG信号，验证了其有效性。

Insight: 创新点在于结合了时空VAE进行神经信号重建，并引入Q-Former和分布对齐（SWD）来弥合视觉与神经模态之间的鸿沟，为大脑编码提供了一种兼顾特征匹配和分布对齐的综合方法。

Abstract: Brain encoding models not only serve to decipher how visual stimuli are transformed into neural responses, but also represent a critical step toward visual prostheses that restore vision for patients with severe vision disorders. Brain encoding involves two fundamental steps: achieving faithful reconstruction of neural responses and establishing cross-modal alignment between visual stimuli and neural responses. To this end, we propose ViBE, a novel brain encoding framework for generating magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli. Specifically, we first design a spatio-temporal convolutional variational autoencoder (TSC-VAE) that captures the spatio-temporal characteristics of M/EEG signals for effective neural response reconstruction. To bridge the modality gap between visual features and neural representations, we employ Q-Former to map CLIP image embeddings to the TSC-VAE latent space, producing neural proxy embeddings. For comprehensive cross-modal alignment, we combine mean squared error (MSE) loss for point-wise feature matching with sliced Wasserstein distance (SWD) for probability distribution alignment between the neural proxy embeddings and TSC-VAE latent embeddings. We conduct extensive experiments on the THINGS-EEG2 and THINGS-MEG datasets, demonstrating the effectiveness of our approach in generating high-quality M/EEG signals from visual stimuli.

[23] HOI-aware Adaptive Network for Weakly-supervised Action Segmentation cs.CVPDF

Runzhong Zhang, Suchen Wang, Yueqi Duan, Yansong Tang, Yue Zhang

TL;DR: 本文提出了一种名为AdaAct的HOI感知自适应网络，用于弱监督动作分割。该方法通过利用视频级的人-物交互（HOI）先验知识，动态调整网络参数，以区分相似动作（如倒果汁和倒咖啡）。

Details

Motivation: 现有方法通常使用固定网络基于相邻帧预测动作，但在估计相似动作时会产生歧义。本文旨在利用时间全局但空间局部的人-物交互（HOI）作为视频级先验知识，以解决动作分割中的歧义问题。

Result: 在Breakfast和50Salads两个广泛使用的数据集上进行的广泛实验表明，该方法在不同评估指标下均有效。

Insight: 创新点包括：设计视频HOI编码器提取、选择和整合最具代表性的HOI；提出双分支超网络学习自适应时间编码器，能够根据视频的HOI信息动态调整参数，从而利用长期HOI序列提供关键上下文信息以区分模糊动作。

Abstract: In this paper, we propose an HOI-aware adaptive network named AdaAct for weakly-supervised action segmentation. Most existing methods learn a fixed network to predict the action of each frame with the neighboring frames. However, this would result in ambiguity when estimating similar actions, such as pouring juice and pouring coffee. To address this, we aim to exploit temporally global but spatially local human-object interactions (HOI) as video-level prior knowledge for action segmentation. The long-term HOI sequence provides crucial contextual information to distinguish ambiguous actions, where our network dynamically adapts to the given HOI sequence at test time. More specifically, we first design a video HOI encoder that extracts, selects, and integrates the most representative HOI throughout the video. Then, we propose a two-branch HyperNetwork to learn an adaptive temporal encoder, which automatically adjusts the parameters based on the HOI information of various videos on the fly. Extensive experiments on two widely-used datasets including Breakfast and 50Salads demonstrate the effectiveness of our method under different evaluation metrics.

[24] DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation cs.CV | cs.AIPDF

Junhu Fu, Ke Chen, Weidong Guo, Shuyu Liang, Jie Xu

TL;DR: DepthPilot是一个用于结肠镜视频生成的可解释性框架，通过深度约束和自适应样条去噪模块，确保生成的视频在几何和解剖学上具有保真度，并捕捉复杂的时空动态，从而推动可控生成向可解释生成发展。

Details

Motivation: 当前可控医学视频生成缺乏可解释性，即生成内容与物理先验及真实临床表现的对齐不足，论文旨在解决这一问题，实现从单纯可控性到可解释性的跨越。

Result: 在三个公共数据集和内部临床数据上的广泛评估表明，DepthPilot能生成物理一致的视频，在所有基准测试中FID分数低于15，并在临床医生评估中排名第一，弥合了’视觉真实’与’临床可解释’之间的差距。

Insight: 创新点包括：提出先验分布对齐策略，通过参数高效微调将深度约束注入扩散模型主干以确保解剖保真；采用自适应样条去噪模块，用可学习的样条函数替代固定线性权重以增强几何约束下的非线性建模能力；生成的视频有望支持可靠的3D重建，促进手术导航和盲区识别，为结肠镜世界模型奠定基础。

Abstract: Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generated contents with physical priors and faithful clinical manifestations. To push the boundaries from mere controllability to interpretability, we propose DepthPilot, the first interpretable framework for colonoscopy video generation. This work takes a step toward trustworthy generation through two synergistic paradigms. To achieve explicit geometric grounding, DepthPilot devises a prior distribution alignment strategy, injecting depth constraints into the diffusion backbone via parameter-efficient fine-tuning to ensure anatomical fidelity. To enhance intrinsic nonlinear modeling under these geometric constraints, DepthPilot employs an adaptive spline denoising module, replacing fixed linear weights with learnable spline functions to capture complex spatio-temporal dynamics. Extensive evaluations across three public datasets and in-house clinical data confirm DepthPilot’s robust ability to produce physically consistent videos. It achieves FID scores below 15 across all benchmarks and ranks first in clinician assessments, bridging the gap between “visually realistic” and “clinically interpretable”. Moreover, DepthPilot-generated videos are expected to enable reliable 3D reconstruction, facilitating surgical navigation and blind region identification, and serve as a foundation toward the colorectal world model.

[25] EnerGS: Energy-Based Gaussian Splatting with Partial Geometric Priors cs.CVPDF

Rui Song, Tianhui Cai, Markus Gross, Yun Zhang, Walter Zimmer

TL;DR: 本文提出EnerGS方法，将部分可观测的几何信息建模为连续能量场，为3D高斯泼溅（3DGS）的优化提供软几何引导，以解决大规模户外场景中几何先验不完整且分布不均的问题，从而提升重建的光度质量和几何稳定性。

Details

Motivation: 针对大规模户外场景中几何先验（如LiDAR测量）通常空间不完整且不均匀，若作为硬约束可能损害最终重建质量的问题，旨在开发一种更稳健的利用部分几何先验的方法。

Result: 在大规模户外场景的稀疏多视图和单目设置下，实验表明EnerGS能持续提升光度重建质量和几何稳定性，并有效缓解3DGS训练过程中的过拟合问题。

Insight: 创新点在于将部分几何观测建模为连续能量场，为高斯基元优化提供软引导而非硬约束，允许几何信息引导优化过程而不直接限制解空间，这为处理不完整先验提供了新思路。

Abstract: 3D Gaussian Splatting (3DGS) has been widely adopted for scene reconstruction, where training inherently constitutes a highly coupled and non-convex optimization problem. Recent works commonly incorporate geometric priors, such as LiDAR measurements, either for initialization or as training constraints, with the goal of improving photometric reconstruction quality. However, in large-scale outdoor scenarios, such geometric supervision is often spatially incomplete and uneven, which limits its effectiveness as a reliable prior and can even be detrimental to the final reconstruction. To address this challenge, we model partially observable geometry as a continuous energy field induced by geometric evidence and propose EnerGS. Rather than enforcing geometry as a hard constraint, EnerGS provides a soft geometric guidance for the optimization of Gaussian primitives, allowing geometric information to steer the optimization process without directly restricting the solution space. Extensive experiments on large-scale outdoor scenes demonstrate that, under both sparse multi-view and monocular settings, EnerGS consistently improves photometric quality and geometric stability, while effectively mitigating overfitting during 3DGS training.

[26] Camera-RFID Fusion for Robust Asset Tracking in Forested Environments cs.CVPDF

John Hateley, Sriram Narasimhan, Omid Abari

TL;DR: 本文提出了一种新颖的相机-RFID融合框架，用于在森林环境中实现鲁棒的资产追踪。该方法通过整合深度与物体信息，并采用先进的轨迹匹配算法，将RFID的米级精度与相机的厘米级精度相结合，以克服信号衰减、多径效应、空间关联模糊和部分遮挡等挑战，从而在资产暂时离开相机视野时也能实现可靠的标签定位。

Details

Motivation: 在森林环境中，被动RFID标签成本低、可扩展，但信号衰减和多径效应使其空间精度仅达米级；而基于立体视觉的相机虽能达到厘米级精度，但单独使用难以解决密集环境中的空间关联模糊和部分遮挡问题。因此，需要融合两种模态以同时利用视觉的高精度和RFID的非视距识别优势。

Result: 论文提出的方法通过融合相机和RFID数据，成功弥合了米级到厘米级的精度差距，实现了可靠的标签定位。据作者所知，这是首次将相机-RFID融合应用于自然森林环境中的资产追踪。

Insight: 核心创新在于提出了一种新颖的融合框架，通过深度和物体信息整合与先进的轨迹匹配算法，解决了不同传感器（相机与RFID）生成轨迹的准确关联这一关键挑战，从而在复杂自然环境中实现了鲁棒的多模态资产追踪。

Abstract: Passive RFID tags offer a cost-effective and scalable solution for tracking numerous deployed assets. However, in forested environments, signal attenuation and multipath effects generally limit RFID spatial accuracy to the meter level. Conversely, while cameras employing stereo vision can achieve centimeter-level precision, relying solely on computer vision fails to resolve issues arising from spatial association ambiguity and partial occlusions in dense settings. Fusing these modalities allows systems to harness the high-accuracy benefits of vision while retaining the robust, non-line-of-sight identification advantages of RFID. Yet, a primary challenge in achieving this, which is the central focus of this paper, lies in accurately associating the disparate trajectories generated by these two sensors. To overcome this limitation, we introduce a novel camera–RFID fusion framework that integrates depth and object information with advanced trajectory-matching algorithms. By successfully bridging the meter-to-centimeter accuracy gap, the proposed approach helps achieve reliable tag localization even when assets temporarily leave the camera’s field of view. To the best of our knowledge, this represents the first application of camera–RFID fusion for asset tracking in natural forested environments.

[27] MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution cs.CV | cs.AIPDF

Jiaqi Guo, Mingzhen Li, Haohong Wang, Aggelos K. Katsaggelos

TL;DR: 本文提出了一种名为MetaSR的生成式超分辨率框架，旨在解决真实场景中内容和退化类型多变的问题。该框架基于扩散Transformer（DiT），能够根据内容自适应地选择和注入任务相关的元数据（metadata）来指导超分辨率重建，并在资源受限条件下实现高效传输。通过一种高效的蒸馏策略，MetaSR支持一步扩散推理。实验表明，该方法在多种内容类别和退化情况下均优于现有方案，在保持相同质量时最高可节省50%的传输比特率，PSNR提升最高达1.0 dB。

Details

Motivation: 真实世界的超分辨率任务面临内容和退化类型（如文本叠加、快速运动、平滑卡通、低光照人脸）跨域、跨类型、跨片段变化的问题，且有用的辅助信息（元数据）通常与内容相关。现有元数据引导的超分辨率方法通常采用固定的条件设计，这在有用线索内容依赖且传输预算有限时是次优的。

Result: 在多样化的内容桶（content buckets）和退化机制（degradation regimes）上的实验表明，MetaSR在匹配质量下，PSNR最高可提升1.0 dB，同时传输比特率最高可节省50%。这些增益是在一个联合考虑发送端比特率和接收端/显示质量指标（如PSNR和SSIM）的率失真优化（RDO）框架下评估的。

Insight: 论文的创新点在于提出了一种内容自适应的元数据编排（orchestration）机制，利用DiT自身的VAE和Transformer主干来融合异构元数据，并采用高效的蒸馏策略实现一步推理。从客观角度看，其核心是将元数据的选择与注入过程动态化、内容适配化，并与传输效率（比特率）联合优化，这为资源受限下的生成式超分辨率提供了一种灵活的框架。

Abstract: We study generative super-resolution (SR) in real-world scenarios where content and degradations vary across domains, genres, and segments. For example, images and videos may alternate between text overlays, fast motion, smooth cartoons, and low-light faces, each benefiting from different forms of side information. Existing metadata-guided SR methods typically use a fixed conditioning design, which is suboptimal when useful cues are content dependent and transmission budgets are limited. We propose MetaSR, a Diffusion Transformer (DiT)-based framework that selects and injects task-relevant metadata to guide SR under resource constraints. Specifically, we use the DiT’s own VAE and transformer backbone to fuse heterogeneous metadata, and adopt an efficient distillation strategy that enables one-step diffusion inference. Experiments across diverse content buckets and degradation regimes show that MetaSR outperforms reference solutions by up to 1.0~dB PSNR while achieving up to 50% transmission bitrate saving at matched quality. We assess these gains under a rate–distortion optimization (RDO) framework that jointly accounts for sender-side bitrate and receiver/display quality metrics (e.g., PSNR and SSIM).

[28] Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning cs.CVPDF

Hao Guo, Fei Wang, Junjie Chen, Yiqi Nie, Jiaqi Zhao

TL;DR: 本文提出了一种名为结构化定性推理（SQI）的训练免费、以数据为中心的框架，旨在增强冻结视觉语言模型（VLMs）的视觉基础能力，以应对光学幻觉带来的感知脆弱性问题。该框架通过三个系统模块（公理约束注入、层次化场景分解和反事实自验证）在推理时协调定性约束，从而将高层语言推理与低层视觉感知对齐。

Details

Motivation: 尽管视觉语言模型（VLMs）在通用视觉任务上已达到最先进性能，但其感知鲁棒性在面对光学幻觉时仍非常脆弱，这些失败常归因于模型优先使用语言先验和记忆原型而非直接视觉证据的捷径启发式方法。

Result: 在DataCV 2026挑战赛（任务I：经典幻觉理解）中，SQI框架总体排名第二。实验结果表明，SQI不仅显著提高了跨多种幻觉类别的准确性，还提供了优越的诊断可解释性，且无需任何模型微调。

Insight: 论文的创新点在于提出了一个无需训练的结构化定性推理框架，通过公理约束注入、层次化场景分解和反事实自验证三个模块，系统地缓解了VLMs中的视觉幻觉问题，强调了结构化定性基础作为开发下一代抗幻觉视觉语言系统的稳健范式的潜力。

Abstract: While Vision-Language Models (VLMs) have achieved state-of-the-art performance in general visual tasks, their perceptual robustness remains remarkably brittle when confronted with optical illusions. These failures are often attributed to shortcut heuristics, where models prioritize linguistic priors and memorized prototypes over direct visual evidence. In this work, we propose Structured Qualitative Inference (SQI), a training-free, data-centric framework designed to fortify visual grounding in frozen VLMs. SQI addresses perceptual anomalies through three systematic modules: (1) Axiomatic Constraint Injection, which suppresses erroneous metric estimations and quantitative hallucinations; (2) Hierarchical Scene Decomposition, which decouples target visual manifolds from complex background distractors; and (3) Counterfactual Self-Verification, an adversarial reasoning step that mitigates confirmation bias. By orchestrating these qualitative constraints at inference time, SQI effectively aligns high-level linguistic reasoning with low-level visual perception. Our framework was evaluated on the DataCV 2026 Challenge (Task I: Classic Illusion Understanding), where it ranked 2nd place overall. Experimental results demonstrate that SQI not only significantly enhances accuracy across diverse illusion categories but also provides superior diagnostic interpretability without any model fine-tuning. Our success underscores the potential of structured qualitative grounding as a robust paradigm for developing next-generation, illusion-resistant vision-language systems.

Liliang Ye, Guiyi Zeng, Yunyao Zhang, Yi-Ping Phoebe Chen, Junqing Yu

TL;DR: 本文提出了OmniTrend框架，用于社交媒体流行度预测，通过分别建模内容吸引力与上下文曝光度，并联合二者进行预测，以提升模型的可解释性和跨平台泛化能力。

Details

Motivation: 现有方法主要关注内容信号，但未将其与曝光相关模式分离，导致学习到的表示吸收了平台特定的可见性效应，削弱了可解释性和跨平台迁移能力。

Result: OmniTrend在图像和视频平台上实现了稳健的跨平台迁移，并提供了内容吸引力和上下文曝光度的显式分离预测。

Insight: 创新点在于将流行度预测分解为内容吸引力（基于视觉、音频和文本线索）和上下文曝光（基于发布时间、作者活动、主题趋势和检索邻居统计）两个独立模块的联合建模，从而增强可解释性和跨平台泛化性。

Abstract: Predicting social media popularity requires understanding both the intrinsic appeal of content and the external context that determines how it is exposed to users. Existing methods focus on content signals but do not separate them from exposure-related patterns, which causes the learned representations to absorb platform-specific visibility effects and weakens both interpretability and cross-platform transfer. This paper introduces OmniTrend, a unified framework that models popularity as the joint outcome of content attractiveness and contextual exposure. The content module learns cross-modal representations from visual, audio, and textual cues to quantify intrinsic appeal, while the context module estimates exposure from exogenous signals such as posting time, author activity, topical trends, and retrieval-based neighborhood statistics. OmniTrend learns separate predictors for content attractiveness and contextual exposure and integrates them in the final popularity estimate, which makes the role of each factor explicit and supports robust transfer across image and video platforms.

[30] GaitKD: A Universal Decoupled Distillation Framework for Efficient Gait Recognition cs.CVPDF

Yuqi Li, Qian Zhou, Huiran Duan, Jingjie Wang, Shunli Zhang

TL;DR: 本文提出了一种名为GaitKD的通用解耦蒸馏框架，旨在将步态识别中高性能但计算复杂的教师模型知识高效地迁移到轻量级学生模型中。该框架将知识转移解耦为决策级蒸馏和边界级蒸馏两个互补组件，分别处理分类逻辑和嵌入空间结构，支持异构模型且不增加推理成本。

Details

Motivation: 高性能步态识别模型通常依赖深度且计算昂贵的架构，难以实际部署。标准知识蒸馏方法对于具有部件结构（同时依赖部件级分类逻辑和检索嵌入）的步态模型效果不佳，因此需要一种更有效的蒸馏框架。

Result: 在多个步态识别基准测试和不同的师生模型配置下，GaitKD相比强基线方法均取得了持续的性能提升，证明了其有效性。

Insight: 创新点在于将步态知识转移解耦为决策关系传递（通过部件校准的逻辑蒸馏）和嵌入空间结构保持（通过激活边界目标而非直接特征回归）两个互补部分。客观分析认为，这种解耦设计能更稳定地迁移部件结构化模型的知识，且边界保持蒸馏比直接特征回归更鲁棒，框架通用性强，支持异构模型且无额外推理开销。

Abstract: Gait recognition is an attractive biometric modality for long-range and contact-free identification, but high-performing gait models often rely on deep and computationally expensive architectures that are difficult to deploy in practice. Knowledge distillation (KD) offers a natural way to transfer knowledge from a powerful teacher to an efficient student; however, standard KD is often less effective for part-structured gait models, where supervision is formed from both part-wise classification logits and part-wise retrieval embeddings. In this paper, we propose GaitKD, a distillation framework that decouples gait knowledge transfer into two complementary components: decision-level distillation and boundary-level distillation. Specifically, GaitKD aligns the teacher and student through part-calibrated logit distillation to transfer inter-class decision relations, while preserving the teacher-induced partitioning of the embedding space through an activation-boundary objective instead of direct feature regression. With a simple aligned part-wise design, GaitKD supports heterogeneous teacher-student gait models without introducing additional inference cost. Experimental results across multiple gait recognition benchmarks and teacher-student configurations show consistent improvements over strong gait baselines. Our study demonstrates that the two transfer components are complementary, and boundary-preserving distillation provides more stable performance than direct feature regression. Source code is available at https://github.com/liyiersan/GaitKD/

[31] Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding cs.CVPDF

Yufei Yin, Jie Zheng, Qianke Meng, Zhou Yu, Minghao Chen

TL;DR: 本文提出了一种名为MCM-VG的新框架，用于解决零样本三维视觉定位（3DVG）中由于开放词汇三维提议质量差（类别不准确、几何不精确）和详尽多视角推理的空间冗余性导致的瓶颈问题。该框架通过显式建立多个一致的2D-3D映射来实现鲁棒的零样本3DVG，具体包括语义对齐、实例校正和视点蒸馏三个模块，最终将目标消歧任务构建为视觉语言模型的多选推理问题。

Details

Motivation: 现有零样本3DVG方法受限于开放词汇3D提议的低质量（类别错误、几何不精确）以及多视角推理的空间冗余性，导致性能瓶颈。本文旨在通过建立跨维度的2D-3D一致性来解决这些问题，实现更精确的目标定位和可靠的推理。

Result: 在ScanRefer和Nr3D基准测试上的广泛评估表明，MCM-VG为零样本3D视觉定位设定了新的最先进水平（SOTA）。具体而言，在ScanRefer上，Acc@0.25达到62.0%，Acc@0.5达到53.6%，分别比之前的基线大幅提升了6.4%和4.0%。

Insight: 创新点在于主动建立跨三个维度的2D-3D一致性（语义、实例、视点），而非被动依赖噪声3D分割。具体包括：利用LLM进行查询解析和由粗到细的2D-3D匹配以校正语义；利用VLM引导的2D分割重建缺失目标并反投影以精确3D几何；通过聚类相机方向提取最优帧，并将最优RGB帧与鸟瞰图配对形成简洁视觉提示集，将最终消歧构建为VLM的多选推理任务，从而减少冗余并提升鲁棒性。

Abstract: Zero-shot 3D Visual Grounding (3DVG) is a critical capability for open-world embodied AI. However, existing methods are fundamentally bottlenecked by the poor quality of open-vocabulary 3D proposals, suffering from inaccurate categories and imprecise geometries, as well as the spatial redundancy of exhaustive multi-view reasoning. To address these challenges, we propose MCM-VG, a novel framework that achieves robust zero-shot 3DVG by explicitly establishing Multiple Consistent 2D-3D Mappings. Instead of passively relying on noisy 3D segments, MCM-VG enforces 2D-3D consistency across three fundamental dimensions to achieve precise target localization and reliable reasoning. First, a Semantic Alignment module corrects category mismatches via LLM-driven query parsing and coarse-to-fine 2D-3D matching. Second, an Instance Rectification module leverages VLM-guided 2D segmentations to reconstruct missing targets, back-projecting these reliable visual priors to establish accurate 3D geometries. Finally, to eliminate spatial redundancy, a Viewpoint Distillation module clusters 3D camera directions to extract optimal frames. By pairing these optimal RGB frames with Bird’s Eye View maps into concise visual prompt sets, we formulate the final target disambiguation as a multiple-choice reasoning task for Vision-Language Models. Extensive evaluations on ScanRefer and Nr3D benchmarks demonstrate that MCM-VG sets a new state-of-the-art for zero-shot 3D visual grounding. Remarkably, it achieves 62.0% and 53.6% in Acc@0.25 and Acc@0.5 on ScanRefer, outperforming previous baselines by substantial margins of 6.4% and 4.0%.

[32] Semantic Foam: Unifying Spatial and Semantic Scene Decomposition cs.CVPDF

Amr Sharafeldin, Shrisudhan Govindarajan, Thomas Walker, Aryan Mikaeili, Daniel Rebain

TL;DR: 本文提出Semantic Foam方法，通过扩展Radiant Foam的体积Voronoi网格表示，并引入单元级别的显式语义特征场，实现了对场景空间和语义的统一分解，旨在提升基于3D高斯溅射等现代场景重建方法中的语义分割质量和跨视角一致性。

Details

Motivation: 解决现有场景重建方法（如3D高斯溅射）在交互式图形应用中难以与传统3D资产交互的问题，特别是语义分解在分割质量和跨视角一致性方面的挑战。

Result: 实验结果表明，该方法在物体级分割性能上优于Gaussian Grouping和SAGA等最先进方法。

Insight: 创新点在于利用体积Voronoi网格的空间结构进行直接空间正则化，结合单元级语义特征场，有效缓解了基于点表示中常见的遮挡和监督不一致导致的伪影问题，提升了语义分解的鲁棒性。

Abstract: Modern scene reconstruction methods, such as 3D Gaussian Splatting, enable photo-realistic novel view synthesis at real-time speeds. However, their adoption in interactive graphics applications remains limited due to the difficulty of interacting with these representations compared to traditional, human-authored 3D assets. While prior work has attempted to impose semantic decomposition on these models, significant challenges remain in segmentation quality and cross-view consistency.To address these limitations, we introduce Semantic Foam, which extends the recently proposed Radiant Foam representation to semantic decomposition tasks. Our approach leverages the inherent spatial structure of Radiant Foam’s volumetric Voronoi mesh and augments it with an explicit semantic feature field defined at the cell level. This design enables direct spatial regularization, improving consistency across views and mitigating artifacts caused by occlusion and inconsistent supervision, which are common issues in point-based representations.Experimental results demonstrate that our method achieves superior object-level segmentation performance compared to state-of-the-art approaches such as Gaussian Grouping and SAGA.Project page: http://semanticfoam.github.io/

[33] MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution cs.CV | cs.AIPDF

Chunzheng Zhu, Jiaqi Zeng, Junyu Jiang, Jianxin Lin, Yijun Wang

TL;DR: 本文提出MedSynapse-V框架，通过模拟临床医生在图像解读时动态调用隐式诊断记忆的认知过程，解决医学视觉语言模型因离散标记化导致的量化损失、长程信息耗散及缺乏病例自适应专业知识的认知错位问题。该框架包含元查询先验记忆机制、因果反事实精炼和内在记忆转移三个核心组件，将外部专业知识内化为模型参数，显著提升诊断准确性。

Details

Motivation: 解决当前医学视觉语言模型因离散标记化过程与临床医生连续、动态的隐式诊断记忆调用过程存在根本性认知错位的问题，旨在弥合视觉感知与临床直觉之间的鸿沟。

Result: 在多个数据集上的综合实证评估表明，该方法在诊断准确性上显著优于现有最先进方法，特别是思维链范式，实现了SOTA性能。

Insight: 创新点在于提出了一个模拟临床认知的潜在诊断记忆演化框架，其核心是通过元查询机制、基于区域特征掩码的因果反事实奖励精炼以及特权自主双分支的内在记忆转移，将外部专业知识系统地内化为模型的内生参数，从而实现对诊断逻辑的更好对齐。

Abstract: High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model’s hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy.

[34] CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation cs.CV | cs.AIPDF

Sonali Sharma, Jin Long, George Shih, Sarah Eid, Christian Bluethgen

TL;DR: 本文介绍了CheXthought，一个全球多模态数据集，包含来自71个国家501名放射科医生对50,312张胸部X光片的103,592条思维链推理轨迹和6,609,082个同步视觉注意力标注。该数据集旨在捕捉临床推理的认知过程和视觉注意力，而非仅图像-报告对。研究展示了其在提升AI模型事实准确性、空间基础性、减少幻觉以及改善病理分类、视觉忠实度、时序推理和不确定性沟通等方面的临床效用。

Details

Motivation: 当前用于胸部X光解读的视觉-语言模型主要基于图像-报告对数据进行训练，缺乏对临床推理背后认知过程和视觉注意力的建模，这限制了模型的透明度和可解释性。

Result: CheXthought的推理数据在事实准确性和空间基础性上显著优于最先进的视觉-语言模型思维链；使用视觉注意力作为推理提示能恢复遗漏发现并显著减少幻觉；基于CheXthought训练的模型在病理分类、视觉忠实度、时序推理和不确定性沟通方面表现显著更强；利用其多读者标注可直接从图像预测人-人和人-AI分歧。

Insight: 创新点在于首次构建了大规模、全球性的临床思维链和视觉注意力数据集，将认知过程数据化，为开发更透明、可解释的多模态临床推理模型提供了关键资源。从客观角度看，其通过同步标注推理与注意力，为模型提供了更丰富的监督信号，有助于解决AI在医学影像中的幻觉和可解释性问题。

Abstract: Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current vision–language models are primarily trained on datasets of paired images and reports, not the cognitive processes and visual attention that underlie clinical reasoning. Here, we present CheXthought, a global, multimodal resource containing 103,592 chain-of-thought reasoning traces and 6,609,082 synchronized visual attention annotations across 50,312 multi-read chest X-rays from 501 radiologists in 71 countries. Our analysis reveals clinical reasoning patterns in how experts deploy distinct visual search strategies, integrate clinical context, and communicate uncertainty. We demonstrate the clinical utility of CheXthought across four dimensions. First, CheXthought reasoning significantly outperforms state–of–the–art vision–language model chain-of-thought in factual accuracy and spatial grounding. Second, visual attention data used as an inference–time hint recovers missed findings and significantly reduces hallucinations. Third, models trained on CheXthought data achieve significantly stronger pathology classification, visual faithfulness, temporal reasoning and uncertainty communication. Fourth, leveraging CheXthought’s multi-reader annotations, we predict both human–human and human–AI disagreement directly from an image, enabling transparent communication of case difficulty, uncertainty and model reliability. These findings establish CheXthought as a resource for advancing multimodal clinical reasoning and the development of more transparent, interpretable vision–language models.

[35] Motion-Driven Multi-Object Tracking of Model Organisms in Space Science Experiments cs.CVPDF

Jianing You, Han Wang, Kang Liu, Jiale Ding, Fengjie Chu

TL;DR: 本文针对空间科学实验中模型生物的多目标跟踪问题，构建了SpaceAnimal-MOT数据集以表征微重力条件下生物视频中的运动复杂性和长期身份保持挑战，并提出了一个名为ART-Track的运动驱动跟踪框架。该框架通过多模型运动估计处理突发机动和非线性运动，利用运动状态驱动的关联减少密集交互和临时失配下的身份切换，并采用不确定性自适应融合动态平衡空间与运动线索。

Details

Motivation: 解决空间科学实验视频中多动物跟踪的挑战，包括外观线索弱、成像质量低、机动行为复杂以及频繁交互，以实现长期、可解释的个体轨迹，为自动化动物行为分析提供基础。

Result: 在斑马鱼和果蝇序列上，ART-Track显著减少了身份切换，并在遮挡、形变和高密度交互下保持了更稳定的关联，为下游定量行为分析提供了更可靠的跟踪基础。

Insight: 创新点包括构建专门的数据集以表征微重力环境下的跟踪挑战，以及提出一个结合多模型运动估计、运动状态驱动关联和不确定性自适应融合的框架，以运动线索为核心应对弱外观和复杂交互场景，提升了跟踪的鲁棒性和身份一致性。

Abstract: Automated animal behavior analysis relies on long-term, interpretable individual trajectories; however, multi-animal tracking in space science experimental videos remains highly challenging due to weak appearance cues, low-quality imaging, complex maneuvering behaviors, and frequent interactions. To address this problem, we first construct the SpaceAnimal-MOT dataset to characterize the motion complexity and long-term identity preservation challenges in biological videos acquired under microgravity conditions. We then propose ART-Track (Adaptive Robust Tracking), a motion-driven tracking framework tailored to this setting. Specifically, multi-model motion estimation is introduced to handle abrupt maneuvers and nonlinear motion, motion-state-driven association is designed to reduce identity switches under dense interactions and temporary mismatch, and uncertainty-adaptive fusion is used to dynamically balance spatial and motion cues when prediction reliability varies. Experimental results show that ART-Track significantly reduces identity switches on zebrafish and fruitfly sequences, while maintaining more stable association under occlusion, deformation, and high-density interactions, thereby providing a more reliable tracking foundation for downstream quantitative behavior analysis. The code is publicly available at https://github.com/yyy7777777/ART_TRACK/tree/main.

[36] SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness cs.CVPDF

Haiyi Qiu, Kaihang Pan, Jiacheng Li, Juncheng Li, Siliang Tang

TL;DR: 本文提出了SpatialFusion框架，旨在为统一的图像生成模型注入内在的3D几何感知能力。该框架通过混合Transformer架构增强多模态大语言模型的3D几何建模能力，生成度量深度图作为几何支架，并通过深度适配器将其注入扩散模型主干，从而为空间感知任务提供精确约束。

Details

Motivation: 现有统一图像生成模型虽然利用MLLM进行语义理解和扩散模型进行图像生成，但缺乏内在的空间理解和明确的几何引导，因此在空间感知任务上存在根本性局限。

Result: 在空间感知基准测试上性能显著提升，明显优于GPT-4o等领先模型；在文本到图像生成和图像编辑场景中均实现了泛化性能增益，且推理开销可忽略不计。

Insight: 创新点在于通过混合Transformer架构和共享自注意力机制，从丰富语义上下文中学习推导度量深度图，并将其作为显式几何支架注入扩散模型，实现了对生成过程的空间约束，从而提升了模型的空间一致性和几何感知能力。

Abstract: Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone through a specialized depth adapter, providing precise spatial constraints for spatially-coherent image generation. Through a progressive two-stage training strategy, SpatialFusion significantly enhances performance on spatially-aware benchmarks, notably outperforming leading models such as GPT-4o. Additionally, it achieves generalized performance gains across both text-to-image generation and image editing scenarios, all while maintaining negligible inference overhead.

[37] Which Face and Whose Identity? Solving the Dual Challenge of Deepfake Proactive Forensics in Multi-Face Scenarios cs.CVPDF

Lei Zhang, Zhiqing Guo, Dan Ma, Gaobo Yang

TL;DR: 本文提出了一种名为深度可归因水印框架（DAWF）的新方法，用于解决复杂多人场景中深度伪造的主动取证问题。该框架采用多面部编码器-解码器架构，通过选择性区域监督损失和身份载荷嵌入，实现了对伪造面部区域的定位和伪造者身份的溯源。

Details

Motivation: 现有主动取证方案主要依赖单面部设置，难以有效处理复杂多人交互场景（如合影、多人会议）中的深度伪造定位和溯源问题。

Result: 在具有挑战性的多面部数据集上进行的大量实验表明，DAWF在复杂多人场景中实现了优秀的深度伪造定位和可追溯性。

Insight: 创新点在于提出了多面部编码器-解码器架构和选择性区域监督损失机制，避免了传统取证繁琐的离线预处理步骤，实现了网络内并行水印嵌入和跨面部协同处理，从而解决了’哪个面部被伪造’和’伪造了谁’的双重挑战。

Abstract: Unlike single-face forgeries, deepfakes in complex multi-person interaction scenarios (such as group photos and multi-person meetings) more closely reflect real-world threats. Although existing proactive forensics solutions demonstrate good performance, they heavily rely on a “single-face” setting, making it difficult to effectively address the problems of deepfake localization and source tracing in complex multi-person environments. To address this challenge, we propose the Deep Attributable Watermarking Framework (DAWF). This framework adopts a novel multi-face encoder-decoder architecture that bypasses the cumbersome offline pre-processing steps of traditional forensics, facilitating efficient in-network parallel watermark embedding and cross-face collaborative processing. Crucially, we propose a selective regional supervision loss. This innovative mechanism guides the decoder to focus exclusively on the facial regions tampered with by deepfakes. Leveraging this mechanism alongside the embedded identity payloads, DAWF realizes the “which + who” goal, answering the dual questions of which facial region was forged and who was forged. Extensive experiments on challenging multi-face datasets show that DAWF achieves excellent deepfake localization and traceability in complex multi-person scenes.

[38] ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance cs.CV | cs.AIPDF

Yang Yang, Feifan Meng, Han Fang, Weiming Zhang

TL;DR: 本文提出了一种名为ACPO的锚点约束感知优化框架，旨在解决扩散模型训练中缺乏无参考感知质量指导的问题。该方法通过引入基于锚点的正则化，在利用无参考图像质量评估模型提升感知质量的同时，保持与基础扩散模型在噪声预测上的一致性，从而平衡感知质量与生成保真度。

Details

Motivation: 扩散模型训练主要依赖全参考目标函数，强调像素级相似性，但在主观视觉感知质量和文本-图像语义一致性方面可能不足。本文旨在将无参考感知质量融入扩散训练，解决直接优化感知信号导致的训练不稳定和分布漂移问题。

Result: 大量实验表明，该方法在保持生成多样性和训练稳定性的同时，持续提升了感知质量。

Insight: 创新点在于提出了锚点约束优化框架，通过正则化约束噪声预测与基础模型的一致性，实现了感知质量提升与生成保真度的平衡，为扩散模型的感知优化提供了稳定可控的适应方法。

Abstract: Diffusion models have achieved remarkable success in image generation, yet their training is predominantly driven by full-reference objectives that enforce pixel-wise similarity to ground-truth images.Such supervision, while effective for fidelity, may insufficient in terms of subjective visual perception quality and text-image semantic consistency. In this work, we investigate the problem of incorporating no-reference perceptual quality into diffusion training. A key challenge is that directly optimizing perceptual signals, such as those provided by no-reference image quality assessment (NR-IQA) models, introduces a mismatch with the original diffusion objective, leading to training instability and distributional drift during fine-tuning. To address this issue, we propose an anchor-constrained optimization framework that enables stable perceptual adaptation. Specifically, we leverage a learned NR-IQA model as a perceptual guidance signal, while introducing an anchor-based regularization that enforces consistency with the base diffusion model in terms of noise prediction. This design effectively balances perceptual quality improvement and generative fidelity, allowing controlled adaptation toward perceptually favorable outputs without compromising the original generative behavior. Extensive experiments demonstrate that our method consistently enhances perceptual quality while preserving generation diversity and training stability, highlighting the effectiveness of anchor-constrained perceptual optimization for diffusion models.

[39] CO-EVO: Co-evolving Semantic Anchoring and Style Diversification for Federated DG-ReID cs.CV | cs.LGPDF

Fengchun Zhang, Qiang Ma, Liuyu Xiang, Jinshan Lai, Tingxuan Huang

TL;DR: 本文提出CO-EVO，一种用于联邦域泛化行人重识别（FedDG-ReID）的新框架，通过协同进化机制解决语义与风格冲突。该框架包含相机不变语义锚定（CSA）和全局风格多样化（GSD）两个核心组件，前者学习跨相机一致的身份提示以建立纯净的域无关锚点，后者利用全局相机风格库（GCSB）合成逼真扰动以扩展训练数据的视觉边界。两者在协同进化循环中相互作用，引导图像编码器在多样风格变化中学习鲁棒的身份特征。

Details

Motivation: 联邦域泛化行人重识别任务面临分散客户端间固有风格差异的挑战，缺乏全局监督易导致模型陷入捷径学习，即表征过拟合于特定域相机偏差而非通用身份特征。本文旨在解决这种语义与风格冲突，以提升模型在未见目标环境下的泛化能力，同时保护原始数据隐私。

Result: 大量实验表明，CO-EVO在多个基准测试上达到了最先进的性能，证明了语义净化与风格扩展之间的协同作用对于鲁棒的跨域泛化至关重要。

Insight: 创新点在于提出一种协同进化机制，将语义锚定（建立域无关身份锚点以过滤局部成像噪声）与风格多样化（合成逼真扰动以扩展视觉边界）相结合，通过两者的迭代交互，引导模型学习更鲁棒和通用的身份表示。从客观角度看，这种将语义净化与数据增强在联邦学习框架内协同优化的思路，为解决联邦域泛化中的域偏移和过拟合问题提供了新视角。

Abstract: Federated domain generalization for person re-identification (FedDG-ReID) aims to collaboratively train a pedestrian retrieval model across multiple decentralized source domains such that it can generalize to unseen target environments without compromising raw data privacy. However, this task is significantly challenged by the inherent stylistic gaps across decentralized clients. Without global supervision, models easily succumb to shortcut learning where representations overfit to domain specific camera biases rather than universal identity features. We propose CO-EVO, a novel federated framework that resolves this semantic-style conflict through a co-evolutionary mechanism. On the semantic side, Camera-Invariant Semantic Anchoring (CSA) learns identity prompts with cross-camera consistency to establish purified and domain-agnostic anchors that filter out local imaging noise. On the visual side, Global Style Diversification (GSD), powered by a Global Camera-Style Bank (GCSB), synthesizes realistic perturbations to expand the visual boundaries of training data. The core of CO-EVO is its co-evolutionary loop where purified anchors act as gravitational centers to guide the image encoder toward robust anatomical attributes amidst diverse style variations. Extensive experiments demonstrate that CO-EVO achieves state-of-the-art (SOTA) performance, proving that the synergy between semantic purification and style expansion is essential for robust cross-domain generalization. Our code is available at: https://github.com/NanYiyuzurn/ACL-LGPS-2026.

[40] Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning cs.CV | cs.LG | math.ATPDF

Junwon You, Mihyun Jang, Sangwoo Mo, Jae-Hun Jung

TL;DR: 本文提出了一种名为ToMA（Topology-Aware Multimodal Representation Alignment）的框架，用于半监督视觉-语言学习。该方法利用持久同调识别多模态表示流形中的拓扑显著边，并通过可用的跨模态对应关系对齐这些边，以更好地建模全局结构，从而提升模型在遥感图像检索和时尚检索等专业领域的泛化性能。

Details

Motivation: 现有视觉-语言模型在专业领域泛化能力差，而半监督方法虽能缓解此问题，但大多基于成对学习，未能建模多模态表示流形的全局拓扑结构；现有的基于拓扑的对齐方法（如持久图匹配）既不能保证几何对齐，也未充分利用视觉-语言学习中关键的图像-文本配对信息。

Result: 实验表明，ToMA在遥感图像检索任务上取得了显著提升，在时尚检索任务上也获得了稳定但适度的改进；分析显示，ToMA比其它基于拓扑的目标更稳定，且轻量级的H_1-birth边能提供有用的高阶结构信号。

Insight: 创新点在于将持久同调（特别是H_0-death边和轻量级H_1-birth边）引入半监督视觉-语言学习，以对齐多模态表示的拓扑结构，从而捕获连接性和循环结构，无需构建2-单纯形；这为利用拓扑信息增强跨模态对齐提供了新思路。

Abstract: Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, existing methods remain fundamentally pairwise and fail to model the global structure of multimodal representation manifolds. Existing topology-based alignment methods rely on persistence diagram matching, which neither guarantees geometric alignment nor utilizes the image-text pairing information central to vision-language learning. We propose Topology-Aware Multimodal Representation Alignment (ToMA), a framework that uses persistent homology to identify topologically salient edges and aligns them across modalities through available cross-modal correspondences. ToMA leverages both H_0-death edges and lightweight H_1-birth edges, allowing it to capture both connectivity and cycle structure without constructing 2-simplices. Experiments show that ToMA yields stable gains, with clear improvements on remote sensing and modest but consistent benefits on fashion retrieval. Additional analysis shows that ToMA is more stable than alternative topology-based objectives and that lightweight H_1-birth edges provide useful higher-order structural signals.

[41] A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection cs.CVPDF

Tong Lu, Ke Xu, Zimo Zhang, Zitong Zhao, Danwei Weng

TL;DR: 本文提出了一种名为EEGVFusion的多模态预训练网络，用于整合脑电图（EEG）和视频数据以检测小鼠癫痫发作。该方法结合了自监督EEG表示学习、时空视频编码、最优传输对齐和双向交叉注意力机制，以融合神经和行为证据。作者还构建了一个包含15只小鼠93个会话的专家标注数据集用于训练和评估。实验表明，EEGVFusion在随机会话划分和留一被试者测试中均实现了高平衡准确率和低误报率，显著优于单模态方法。

Details

Motivation: 在临床前癫痫研究中，可靠的小鼠癫痫发作检测至关重要，但手动审查同步的EEG-视频记录劳动密集，且单模态系统因互补原因而失败：基于视频的方法易受良性行为干扰，而基于EEG的方法易受发作期运动伪影影响。因此，需要一种多模态框架来整合神经和行为证据，提高检测的鲁棒性和准确性。

Result: 在随机会话划分中，EEGVFusion的平衡准确率达到0.9957，事件灵敏度完美，事件误报率为0.6250 FP/h。在留一被试者测试（Subject 110）中，平衡准确率为0.9718，事件误报率从纯EEG方法的2.7250 FP/h降至0.4833 FP/h，同时保持完美的事件灵敏度。这些结果表明EEGVFusion在癫痫检测任务上达到了SOTA水平，显著降低了误报负担。

Insight: 论文的创新点包括：1）提出了一种结合自监督EEG表示学习、视频编码、最优传输对齐和双向交叉注意力的多模态融合框架，有效整合了互补信息；2）构建了一个专家标注的同步EEG-视频数据集，为多模态癫痫检测研究提供了基准；3）通过消融实验证明，EEG预训练和最优传输对齐有助于减少误报同时保持检测灵敏度，这为多模态学习中的模态对齐提供了实用策略。

Abstract: Reliable seizure detection in mouse models is essential for preclinical epilepsy research, yet manual review of synchronized video-EEG recordings is labor-intensive and single-modality systems fail for complementary reasons: video-based methods are easily confounded by benign behaviors, whereas EEG-based methods are vulnerable to ictal motion artifacts. We present EEGVFusion, a multimodal framework that combines self-supervised EEG representation learning, spatio-temporal video encoding, optimal-transport alignment, and bidirectional cross-attention to integrate neural and behavioral evidence. We also curate an expert-annotated dataset of synchronized EEG and video recordings comprising 93 sessions from 15 mice for training and evaluation. In the random-session split, EEGVFusion achieved a Balanced Accuracy of 0.9957 with perfect event sensitivity and an Event FAR of 0.6250 FP/h, indicating strong seizure detection performance with a low false-alarm burden. In a single held-out-subject evaluation with Subject 110 reserved for testing, EEGVFusion achieved a Balanced Accuracy of 0.9718 and reduced Event FAR from 2.7250 FP/h for the EEG-only counterpart to 0.4833 FP/h while preserving perfect event sensitivity. Targeted ablations further showed that EEG pre-training and OT alignment help reduce false alarms while preserving event sensitivity.

[42] Decoupled Prototype Matching with Vision Foundation Models for Few-Shot Industrial Object Detection cs.CVPDF

Hari Prasanth S. M., Nilusha Jayawickrama, Risto Ojala

TL;DR: 本文提出了一种解耦原型匹配方法，利用视觉基础模型实现工业场景下的少样本目标检测。该方法通过少量参考样本构建类别原型，在推理时使用分割模型生成目标区域并提取特征嵌入，通过相似度匹配实现检测。

Details

Motivation: 工业目标检测系统通常依赖大规模标注数据集，这在对象频繁变化的工业场景中成本高昂且难以维护。本文旨在解决工业场景中仅能获得少量标注样本的新对象少样本检测挑战。

Result: 在Benchmark for 6D Object Pose Estimation的三个工业数据集上，遵循官方2D目标检测评估协议，该方法取得了有竞争力的检测性能，相比最先进的无训练检测方法AP提升了6.9%。

Insight: 创新点在于利用视觉基础模型实现免训练的少样本检测，通过解耦的原型匹配机制，仅需少量参考图像即可引入新对象，无需CAD模型或大规模标注数据集，适合实际工业应用。

Abstract: Industrial object detection systems typically rely on large annotated datasets, which are expensive to collect and challenging to maintain in industrial scenarios where the inventory of objects changes frequently. This work addresses the challenge of few-shot object detection in such industrial scenarios, where only a limited number of labeled samples are available for newly introduced objects. We present a detection framework that leverages vision foundation models to recognize objects with minimal supervision. The method constructs class prototypes from a small set of reference samples by extracting feature representations. For a given query scene during inference, object regions are generated using a segmentation model, and feature embeddings are extracted and matched with class prototypes using similarity matching. We evaluate the detection method on three established industrial datasets from the Benchmark for 6D Object Pose Estimation benchmark following the official 2D object detection evaluation protocol. We demonstrate competitive detection performance, improving AP by 6.9% compared to the state-of-the-art training-free detection methods. Furthermore, the presented method is able to onboard new objects using only a few reference images, without requiring any CAD models or large annotated datasets. These properties make the approach well-suited for real-world industrial applications.

[43] Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection cs.CVPDF

Ahyoung Oh, Wonseok Shin, Songkuk Kim

TL;DR: 本文首次将稀疏自编码器应用于视觉Transformer的[CLS]标记，用于分布外检测。研究发现，分布内数据在解耦后的潜在空间中表现出类别特定的稳定激活模式，而分布外数据会系统性地破坏这种结构。基于此，作者提出了一种基于核心能量分布差异的评分函数，在多个基准测试上取得了较低的FPR95和具有竞争力的AUROC结果。

Details

Motivation: 现有OOD检测方法依赖于纠缠的特征表示，存在局限性。稀疏自编码器在解释LLMs方面已获成功，但其在分析ViTs方面的潜力尚未被充分探索。本文旨在利用SAEs解耦ViT的密集[CLS]特征，以发现用于OOD检测的新见解。

Result: 该方法在多个基准测试上，针对安全敏感应用的关键指标FPR95取得了强劲结果，同时AUROC也达到了有竞争力的水平。

Insight: 核心创新在于首次将SAEs应用于ViT的[CLS]标记进行OOD分析，并发现了“类别激活剖面”这一结构不变量。其提出的基于解耦稀疏特征和核心能量分布差异的评分机制，为视觉模型的鲁棒OOD检测提供了一个强大且可解释的工具。

Abstract: Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD samples systematically disrupt this structure. Leveraging this insight, we introduce a scoring function based on the divergence of core energy profiles to quantify the deviation from ideal activation profiles. Our method achieves strong results on the FPR95 metric, critical for safety-sensitive applications across multiple benchmarks, while also achieving competitive AUROC. Overall, our findings demonstrate that the sparse, disentangled features revealed by SAEs can serve as a powerful, interpretable tool for robust OOD detection in vision models.

[44] Delineating Knowledge Boundaries for Honest Large Vision-Language Models cs.CV | cs.AIPDF

Junru Song, Yimeng Hu, Yijing Chen, Huining Li, Qian Li

TL;DR: 本文提出了一种系统化框架，旨在增强大型视觉语言模型（VLMs）在面对超出其知识范围的问题时的拒绝能力。该框架首先通过多样本一致性探测构建模型特定的’Visual-Idk’数据集，以区分已知和未知事实；随后采用监督微调和偏好感知优化（如DPO、ORPO）来对齐模型，从而有效划定其知识边界。

Details

Motivation: 当前大型视觉语言模型在多模态任务上表现出色，但容易产生事实性幻觉，尤其在长尾或专业领域，且缺乏对超出其参数化知识范围的查询的拒绝能力。

Result: 在Visual-Idk数据集上，该方法将真实率（Truthful Rate）从57.9%提升至67.3%；内部探测进一步表明模型能真正识别其知识边界，而非仅仅记忆拒绝模式。该框架还能泛化到分布外的医疗和感知领域。

Insight: 创新点在于通过模型特定的数据集构建和偏好感知优化来系统化地划定知识边界，从而提升模型的诚实性和可信度，这为构建更可靠的视觉助手提供了稳健路径。

Abstract: Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to refuse queries that exceed their parametric knowledge. In this paper, we propose a systematic framework to enhance the refusal capability of VLMs when facing such unknown questions. We first curate a model-specific “Visual-Idk” (Visual-I don’t know) dataset, leveraging multi-sample consistency probing to distinguish between known and unknown facts. We then align the model using supervised fine-tuning followed by preference-aware optimization (e.g., DPO, ORPO) to effectively delineate its knowledge boundaries. Results on the Visual-Idk dataset show our method improves the Truthful Rate from 57.9% to 67.3%. Additionally, internal probing also demonstrates that the model genuinely recognizes its boundaries instead of just memorizing refusal patterns. Our framework further generalizes to out-of-distribution medical and perceptual domains, providing a robust path toward more trustworthy and prudent visual assistants.

[45] Are Data Augmentation and Segmentation Always Necessary? Insights from COVID-19 X-Rays and a Methodology Thereof cs.CVPDF

Aman Swaraj, Arnav Agarwal, Hitendra Singh Bhadouria, Sandeep Kumar, Karan Verma

TL;DR: 该论文针对COVID-19胸部X光图像分类中普遍存在的两个问题——是否必须进行肺部分割和是否必须使用数据增强——进行了批判性分析。研究通过类激活映射可视化CNN的预测区域，验证了肺部分割的必要性；并通过对比实验，揭示了过度数据增强会导致模型过拟合。最终，论文提出了名为SDL-COVID的方法，在COVID-19检测任务上实现了高精度和低假阴性率。

Details

Motivation: 解决现有基于AI的COVID-19胸部X光分析研究中存在的两个关键问题：一是大多数研究未进行肺部分割，可能影响模型可靠性；二是部分研究使用了不切实际且过度的数据增强技术，导致模型泛化能力差和过拟合。

Result: 在COVID-19胸部X光检测任务上，提出的SDL-COVID方法达到了95.21%的精确度，并具有较低的假阴性率。实验结果表明，肺部分割对于准确预测是必要的，而数据增强超过一定阈值后，测试精度会显著下降，表明模型出现过拟合。

Insight: 论文的核心创新点在于对两个常见但未经充分验证的预处理步骤（肺部分割和数据增强）进行了系统性分析和实证研究，并基于此提出了一个更可靠的方法论（SDL-COVID）。从客观角度看，其研究范式（即通过可视化分析和对比实验来验证基础假设的必要性）具有普适的借鉴意义，尤其是在医学影像分析领域，强调模型决策的可解释性和预处理步骤的合理性，而非盲目应用复杂技术。

Abstract: Purpose: Rapid and reliable diagnostic tools are crucial for managing respiratory diseases like COVID-19, where chest X-ray analysis coupled with artificial intelligence techniques has proven invaluable. However, most existing works on X-ray images have not considered lung segmentation, raising concerns about their reliability. Additionally, some have employed disproportionate and impractical augmentation techniques, making models less generalized and prone to overfitting. This study presents a critical analysis of both issues and proposes a methodology (SDL-COVID) for more reliable classification of chest X-rays for COVID-19 detection. Methods: We use class activation mapping to obtain a visual understanding of the predictions made by Convolutional Neural Networks (CNNs), validating the necessity of lung segmentation. To analyze the effect of data augmentation, deep learning models are implemented on two levels: one for an augmented dataset and another for a non-augmented dataset. Results: Careful analysis of X-ray images and their corresponding heat maps under expert medical supervision reveals that lung segmentation is necessary for accurate COVID-19 prediction. Regarding data augmentation, test accuracy significantly drops beyond a certain threshold with additional augmented images, indicating model overfitting. Conclusion: Our proposed methodology, SDL-COVID, achieves a precision of 95.21% and a lower false negative rate, ensuring its reliability for COVID-19 detection using chest X-rays.

Wasim Ahmad, Wei Zhang, Xuerui Mao

TL;DR: 本文提出了一个基于属性引导的多模态深度伪造检测框架（AMDD），通过联合学习检测和属性识别来增强模型对生成器特定伪造痕迹的捕捉能力，而非依赖数据集特定伪影。该方法利用跨模态法医指纹一致性损失强制视觉和音频流中生成器诱导伪影的对齐，从而提升检测的泛化性能。

Details

Motivation: 针对当前多模态深度伪造检测方法大多为二分类任务，容易学习数据集特定伪影而非真正的生成痕迹，导致泛化能力差的问题，本文提出通过属性引导学习在共享嵌入空间施加更强的几何约束，迫使模型编码生成器特定的法医内容。

Result: 在FakeAVCeleb数据集上，AMDD实现了99.7%的平衡准确率和99.8%的AUC，属性识别准确率达到95.9%。在DeepfakeTIMIT、DFDM和LAV-DF的跨数据集评估中，真实视频检测表现出稳健的泛化能力，但针对未见生成器的伪造检测仍是一个开放挑战。

Insight: 创新点在于将生成器属性识别作为结构化正则化，约束表示几何朝向法医意义特征；引入跨模态法医指纹一致性损失，利用语音与面部动作的物理耦合关系，强制跨模态伪造痕迹对齐；通过平衡视觉和音频编码器容量（ResNet50与ResNet18配对）弥补先前模型的编码器能力差距。

Abstract: Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts. We propose the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation. AMDD treats generator attribution as a structured regularization that constrains representation geometry toward forensically meaningful features. We introduce a Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss to enforce alignment between generator-induced artifacts in visual and audio streams. This exploits the fact that coherent manipulation leaves correlated traces across modalities, grounded in the physical coupling between speech and facial articulation that synthetic pipelines routinely disrupt. Architecturally, we pair a ResNet50 with temporal attention for visual encoding against a pretrained ResNet18 for mel spectrograms, closing the encoder capacity gap found in prior models. On FakeAVCeleb, AMDD achieves 99.7% balanced accuracy and 99.8% AUC with 95.9% attribution accuracy. Cross-dataset evaluation on DeepfakeTIMIT, DFDM, and LAV-DF confirms that real video detection generalizes robustly, while fake detection on unseen generators remains an open challenge that we analyze in depth.

[47] Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation cs.CVPDF

Gongshu Wang, Zhirui Wang, Kan Yang

TL;DR: 本文提出了一种名为‘以最后一层为中心的特征重组（LFR）’的模块，用于提升基于DINOv3视觉基础模型的单目深度估计性能。研究发现，DINOv3中编码的3D几何信息在Transformer各层中分布不均匀，深层特征具有更强的深度预测能力。LFR模块以最后一层特征为几何锚点，根据最小相似性准则自适应地选择互补的中间层特征，并通过紧凑的线性适配器进行融合。

Details

Motivation: 现有方法通常均匀采样Transformer中间层来构建多尺度特征，这隐含假设几何信息在各层均匀分布，可能未能充分利用视觉基础模型中编码的结构化3D线索。本文旨在通过分析DINOv3各层的几何信息分布，并设计更有效的特征利用策略来解决此问题。

Result: 广泛的实验表明，LFR模块能持续提升单目深度估计的精度，并在多个基准测试上达到了最先进的性能水平。

Insight: 论文的核心创新点在于揭示了DINOv3等视觉基础模型中3D几何知识的非均匀层间分布特性，并据此提出了一个以最后一层为中心、自适应选择互补特征进行重组的轻量级模块。这为解锁视觉基础模型在密集3D任务中的潜力提供了一种高效策略。

Abstract: Monocular depth estimation (MDE) is a fundamental yet inherently ill-posed task. Recent vision foundation models (VFMs), particularly DINO-based transformers, have significantly improved accuracy and generalization for dense prediction. Prior works generally follow a unified paradigm: sampling a fixed set of intermediate transformer layers at uniform intervals to build multi-scale features. This common practice implicitly assumes that geometric information is uniformly distributed across layers, which may underutilize the structural 3D cues encoded in VFMs. In this study, we present a systematic layer-wise analysis of DINOv3, revealing that 3D information is distributed non-uniformly: deeper layers exhibit stronger depth predictability and better capture inter-sample geometric variation. Motivated by this, we introduce a Last-Layer-Centric Feature Recombination (LFR) module to enhance geometric expressiveness. LFR treats the final layer as a geometric anchor and adaptively selects complementary intermediate layers according to a minimal-similarity criterion. Selected features are fused with the last-layer representation via compact linear adapters.Extensive experiments show that LFR module consistently improves MDE accuracy and achieves state-of-the-art performance. Our analysis sheds light on how geometric knowledge is organized within VFMs and offers an efficient strategy for unlocking their potential in dense 3D tasks.

[48] $\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding cs.CVPDF

Lingjie Zeng, Hailun Zhang, Xiwen Wang, Qijun Zhao

TL;DR: 该论文提出了PKS^4（并行运动学选择性状态空间扫描器），一种用于高效视频理解的新方法。它通过保留标准的2D视觉主干网络处理空间语义，并插入一个具有线性复杂度时间扫描的即插即用模块，避免了传统密集时空注意力的二次计算成本和PEFT方法的高激活内存开销。该方法利用运动学先验编码器提取局部位移和运动边界，驱动线性复杂度的状态空间模型跟踪底层运动状态，并在时间维度上为每个空间位置部署并行扫描器以保持空间结构。

Details

Motivation: 视频理解中的时序建模面临根本性挑战，尤其是长视频序列。传统依赖密集时空注意力的视频模型计算成本呈二次增长，而近期通过参数高效微调方法适配图像模型的方法又因深度插入模块导致反向传播时激活内存开销过高。虽然高效状态空间模型引入了线性复杂度，但破坏了2D空间关系且依赖大量掩码预训练来恢复空间感知。

Result: 在空间密集型和时序密集型动作识别基准测试中，PKS^4达到了最先进的性能。该方法仅需20个epoch即可收敛，训练计算量比纯视频状态空间模型低约10倍。

Insight: 创新点在于结合了2D视觉主干与线性复杂度时间扫描模块，通过运动学先验编码器提取并利用运动信息自适应调制状态更新，并采用并行时间扫描器在保持空间结构的同时降低开销，为高效视频理解建立了新范式。

Abstract: Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations, we propose Parallel Kinematic Selective State Space Scanners (PKS$^4$). We retain a standard 2D vision backbone for spatial semantics and insert a single plug-and-play PKS$^4$ module with linear-complexity temporal scanning, avoiding temporal attention and multi-layer adapters. We first extract kinematic priors via a Kinematic Prior Encoder, which captures local displacements and motion boundaries through inter-frame correlations and differences. These priors drive linear-complexity SSMs to track underlying kinematic states, adaptively modulating update speeds and read-write strategies at each time step. Instead of global scanning, we deploy parallel scanners along the temporal dimension for each spatial location, preserving spatial structures while reducing overhead. Experiments on spatial-heavy and temporal-heavy action recognition benchmarks show that PKS$^4$ achieves state-of-the-art performance. Remarkably, our method converges in merely $20$ epochs, achieving approximately $10\times$ lower training compute than pure video SSMs, establishing a new paradigm for efficient video understanding.

[49] A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows cs.CVPDF

Yuxuan Han, Yuanxing Zhang, Yushuo Wang, Yichao Jin, Kenneth Zhu Ke

TL;DR: 本文提出了一种多阶段信息提取框架，用于处理工业KYC工作流中的长扫描金融文档。该框架整合了图像预处理、多语言OCR、混合页面级检索和基于紧凑视觉语言模型的结构化提取，通过将页面定位与多模态推理分离，显著提升了从复杂多页文档中提取结构化信息的准确性。

Details

Motivation: 解决从长、多语言、非机器可读、噪声大且视觉异构的扫描金融文档中可靠提取结构化信息的工业需求，因为现有端到端视觉语言模型在真实场景下直接应用于完整财务报告时提取结果不可靠。

Result: 在包含约3000页多语言扫描页面的120份生产KYC文档上评估，所提流水线始终优于直接的PDF-to-VLM基线方法，字段级准确率最高提升31.9个百分点。最佳配置（PaddleOCR与MiniCPM2.6）达到87.27%的准确率。消融研究表明页面级检索是性能提升的主导因素。

Insight: 主要创新点在于将端到端提取分解为多阶段流水线，特别是将页面定位（检索）与内容理解（提取）解耦的设计，这有效处理了长文档中信息稀疏和噪声的问题。从工业实践角度看，模块化设计便于集成不同OCR和VLM组件，并突出了页面级检索在复杂、多语言文档处理中的关键作用。

Abstract: Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non machine readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task relevant information. Although recent vision-language models achieve strong benchmark performance, directly applying them end to end to full financial reports often leads to unreliable extraction under real world conditions. We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multipage documents. We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR-VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM2.6, achieves 87.27 percent accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.

[50] Cross-Domain Transfer of Hyperspectral Foundation Models cs.CVPDF

Nick Theisen, Peer Neubert

TL;DR: 本文提出了一种跨领域迁移方法，将原本在遥感领域训练的高光谱成像（HSI）基础模型，重新用于近端传感应用，以解决HSI语义分割中数据有限的问题。该方法避免了跨模态方法中光谱信息丢失或架构复杂的问题，在HS3-Bench基准测试中表现出色。

Details

Motivation: 高光谱成像语义分割通常依赖域内训练，但实际应用中数据有限限制了模型性能；现有利用基础模型的方法多采用跨模态技术（连接RGB和HSI），但会丢弃光谱信息或引入复杂架构。

Result: 在HS3-Bench基准上，跨领域迁移相比域内模态内训练实现了大幅性能提升，缩小了与跨模态方法的性能差距，并在有限数据设置下保持了强劲性能。

Insight: 创新点在于提出跨领域迁移策略，直接重用HSI基础模型（从遥感迁移到近端传感），避免了跨模态的复杂性，有效保留了光谱信息；从客观角度看，这为数据稀缺的HSI应用提供了一种简单高效的模型复用途径。

Abstract: Hyperspectral imaging (HSI) semantic segmentation typically relies on in-domain training, but limited data availability often restricts model performance in real-world applications. Current approaches to leverage foundation models in proximal sensing use cross-modality techniques, bridging RGB and HSI to exploit vision foundation models. However, these methods either discard spectral information or introduce architectural complexity. We propose cross-domain transfer as an alternative, reusing HSI foundation models - originally trained in remote sensing - for proximal sensing applications. By eliminating the need to bridge modality gaps, our approach preserves spectral information while maintaining a simple architecture. Using the HS3-Bench benchmark, we systematically evaluate and compare conventional in-domain, in-modality training, cross-modality transfer and cross-domain transfer strategies. Our results demonstrate that cross-domain transfer achieves large performance improvements over in-domain, in-modality training, reduces the performance gap to cross-modality approaches and maintains strong performance in limited data settings. Thus, this work advances more effective HSI semantic segmentation in diverse applications.

[51] Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners cs.CV | cs.LGPDF

Nikita Araslanov, Martin Sundermeyer, Hidenobu Matsuki, David Joseph Tan, Federico Tombari

TL;DR: 本文提出了一种名为LILA的框架，通过学习线性上下文学习器，从视频中学习像素级精确的特征描述符，以嵌入动态3D场景的时空属性。该框架利用现成网络估计的深度和运动等时空线索图，在未剪辑的视频数据集上有效训练，将语义和几何属性以时间一致的方式嵌入。

Details

Motivation: 现有视觉基础模型缺乏在像素级别有效嵌入视觉场景时空属性的表示方法，图像预训练任务忽略动态元素，视频序列训练仅适用于动作级推理而无法扩展至密集像素级预测。

Result: 在视频对象分割、表面法线估计和语义分割等多个视觉任务上展示了学习表示的显著经验优势，表明其有效性。

Insight: 创新点在于利用线性上下文学习从噪声时空线索中学习像素级特征，实现动态场景的时空一致嵌入，可借鉴于提升像素级视觉任务的表示学习能力。

Abstract: One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps – depth and motion – estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

[52] Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models cs.CVPDF

Haosen Li, Wenshuo Chen, Lei Wang, Shaofeng Liang, Bowen Tian

TL;DR: 本文针对扩散模型中无分类器引导（CFG）存在的‘细节-伪影困境’问题，提出了一种无需训练、几乎零成本的采样算法——空间自适应多引导（SAMG）。该方法通过微分几何分析揭示了CFG均匀线性外推导致正交偏差的物理根源，并基于理论推导的空间自适应引导上界，动态计算逐点条件引导能量，在不同区域应用保守或激进的引导尺度，从而在保持结构完整性和时间一致性的同时实现更好的语义对齐。

Details

Motivation: 标准无分类器引导（CFG）使用全局均匀的标量引导尺度，导致生成内容陷入‘细节-伪影困境’：低引导尺度无法注入精细语义，高尺度则引发结构退化、颜色过饱和及视频时间不一致。本文旨在从几何层面揭示此缺陷根源并提出解决方案。

Result: 在多种图像（SD 1.5, SDXL, SD3.5 Medium）和视频（CogVideoX, ModelScope）架构上的大量实验表明，SAMG有效解决了细节-伪影困境，在语义对齐、结构完整性和时间平滑性方面均取得优越性能，且无需额外计算开销。

Insight: 创新点在于从微分几何角度（通过Tweedie公式）揭示了CFG本质是切向线性外推，在高度弯曲的数据流形上引入正交偏差；据此提出了理论引导上界，并设计了基于逐点条件引导能量的空间自适应多引导策略，实现了训练免费、零成本的高质量生成引导。

Abstract: Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented “detail-artifact dilemma”: low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie’s Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.

[53] DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CVPDF

Mingji Ge, Qirui Chen, Zeqian Li, Weidi Xie

TL;DR: 本文提出了一种无需训练的自动化流程DenseStep2M，用于从野外教学视频中提取高质量、结构化的程序性步骤标注。该流程通过视频镜头分割、内容对齐过滤，并利用先进的多模态和大语言模型（如Qwen2.5-VL和DeepSeek-R1）生成时序锚定的步骤。由此构建了一个包含约10万视频和200万详细步骤的大规模数据集，旨在支持全面的长视频理解。

Details

Motivation: 解决教学视频语料库（如HowTo100M）中存在的噪声ASR转录文本以及叙述与视觉内容时序对齐不一致的挑战，以自动化方式生成高质量的程序性视频标注，支持长期视频理解。

Result: 在精心构建的高质量人工标注基准DenseCaption100上评估，显示自动生成的步骤与人工标注高度对齐。在三个下游任务（密集视频描述、程序步骤定位和跨模态检索）上验证了DenseStep2M的效用，基于其微调的模型在描述质量和时序定位方面取得显著提升，并在第一人称、第三人称和混合视角领域展现出强大的零样本泛化能力。

Insight: 创新点在于提出了一种完全无需训练、可扩展的自动化标注流程，结合了镜头分割、对齐过滤和先进多模态/大语言模型，能够从噪声视频中生成结构化、时序锚定的步骤。其构建的大规模高质量数据集DenseStep2M为长视频理解和多模态对齐任务提供了有力支持，并展示了在下游任务中的显著性能增益和良好的领域泛化性。

Abstract: Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support comprehensive long-form video understanding. To rigorously evaluate our pipeline, we curate DenseCaption100, a benchmark of high-quality, human-written captions. Evaluations demonstrate strong alignment between our auto-generated steps and human annotations. Furthermore, we validate the utility of DenseStep2M across three core downstream tasks: dense video captioning, procedural step grounding, and cross-modal retrieval. Models fine-tuned on DenseStep2M achieve substantial gains in captioning quality and temporal localization, while exhibiting robust zero-shot generalization across egocentric, exocentric, and mixed-perspective domains. These results underscore the effectiveness of DenseStep2M in facilitating advanced multimodal alignment and long-term activity reasoning. Our dataset is available at https://huggingface.co/datasets/mingjige/DenseStep2M.

[54] AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision cs.CVPDF

Xiaoya Cheng, Rouwan Wu, Xinyi Liu, Zeyu Cui, Yan Liu

TL;DR: 本文提出了AirZoo，一个用于支撑航空几何三维视觉的统一大规模数据集和基准。它通过可扩展的生成流程，利用全球尺度的摄影测量三维网格，渲染出具有可定制无人机飞行轨迹和天气/光照条件的大规模户外环境，旨在解决该领域高质量训练数据严重匮乏的问题。

Details

Motivation: 当前数据驱动的三维视觉进展迅速，但航空几何三维视觉由于缺乏大规模、高保真度的训练数据而面临巨大挑战。现有基准主要偏向地面或物体中心视角，未能考虑无人机感知中复杂的视角变换和多样的环境条件。

Result: 通过在航空图像检索、跨视图匹配和多视图三维重建三个严格评估任务上的实验，证明AirZoo可作为强大的预训练引擎。在公开和新收集的真实世界基准测试中，在AirZoo上微调能显著提升SoTA模型（如MegaLoc、RoMa、VGGT和Depth Anything 3）的性能，为航空空间智能建立了新的性能上限。

Insight: 创新点在于构建了一个具有可扩展生成流程、全面场景多样性和丰富几何标注的统一大规模航空数据集。其利用全球尺度的开源三维数据生成可控合成数据的方法，为解决特定领域数据稀缺问题提供了可借鉴的途径，其系统性的场景覆盖和精确的几何标注对几何感知学习至关重要。

Abstract: Despite the rapid progress in data-driven 3D vision, aerial geometric 3D vision remains a formidable challenge due to the severe scarcity of large-scale, high-fidelity training data. Existing benchmarks, predominantly biased toward ground-level or object-centric views, do not account for complex viewpoint transformations and diverse environmental conditions in UAV-based sensing. To bridge this critical gap, we propose AirZoo, a unified large-scale dataset and benchmark for grounding aerial geometric 3D vision. AirZoo possesses three appealing properties: 1) Scalable Generation Pipeline: Leveraging freely available, world-scale photogrammetric 3D meshes, it renders vast outdoor environments with customizable UAV flight trajectories and configurable weather/illumination. 2) Comprehensive Scene Diversity: It provides the most extensive coverage of region types to date (spanning 378 regions across 22 countries), systematically encompassing both highly structured urban landscapes and complex unstructured natural environments. 3) Rich Geometric Annotations: Each frame provides synchronized, pixel-level metric depth and precise 6-DoF geo-referenced poses, essential for geometry-aware learning. Through three rigorous evaluation tracks – aerial image retrieval, cross-view matching, and multi-view 3D reconstruction – we demonstrate that AirZoo serves as a powerful pre-training engine. Extensive experiments on both public and newly collected real-world benchmarks reveal that fine-tuning on AirZoo yields substantial performance gains for SoTA models (e.g., MegaLoc, RoMa, VGGT, and Depth Anything 3), establishing a new performance upper bound for aerial spatial intelligence.

May Hammad, Menatallh Hammad

TL;DR: 本文提出Star-Fusion，一种用于航天器姿态确定的多模态Transformer架构。它将连续的天球方向估计重新表述为离散的拓扑分类任务，通过球形K-Means聚类将天球划分为K个拓扑一致区域，以解决坐标缠绕问题。模型融合了SwinV2-Tiny Transformer、卷积热图分支和基于坐标的MLP三种模态。

Details

Motivation: 传统’迷失太空’算法计算开销大且对传感器噪声敏感，而标准深度学习回归模型难以处理天球的非欧几里得拓扑结构以及赤经、赤纬的周期性边界条件。

Result: 在合成的依巴谷星表数据集上，Star-Fusion的Top-1准确率达到93.4%，Top-3准确率达到97.8%。在资源受限的商用硬件上，推理延迟为18.4毫秒，具备实时部署潜力。

Insight: 核心创新在于将连续姿态回归问题转化为离散拓扑分类，利用球形聚类规避坐标周期性；采用多模态融合策略（Transformer、CNN、MLP）分别处理光度、空间和几何信息，兼顾精度与效率。

Abstract: Reliable celestial attitude determination is a critical requirement for autonomous spacecraft navigation, yet traditional “Lost-in-Space” (LIS) algorithms often suffer from high computational overhead and sensitivity to sensor-induced noise. While deep learning has emerged as a promising alternative, standard regression models are often confounded by the non-Euclidean topology of the celestial sphere and by the periodic boundary conditions of Right Ascension (RA) and Declination (Dec). In this paper, we present Star-Fusion, a multi-modal architecture that reformulates orientation estimation as a discrete topological classification task. Our approach leverages spherical K-Means clustering to partition the celestial sphere into K topologically consistent regions, effectively mitigating coordinate wrapping artifacts. The proposed architecture employs a tripartite fusion strategy: a SwinV2-Tiny transformer backbone for photometric feature extraction, a convolutional heatmap branch for spatial grounding, and a coordinate-based MLP for geometric anchoring. Experimental evaluations on a synthetic Hipparcos-derived dataset demonstrate that Star-Fusion achieves a Top-1 accuracy of 93.4% and a Top-3 accuracy of 97.8%. Furthermore, the model exhibits high computational efficiency, maintaining an inference latency of 18.4 ms on resource-constrained COTS hardware, making it a viable candidate for real-time onboard deployment in next-generation satellite constellations.

[56] State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading cs.CVPDF

Yuanze Hu, Gen Li, Yuqin Lan, Qingchen Yu, Zhichao Yang

TL;DR: 本文研究了多模态大语言模型在基于表盘的测量读数任务中的性能瓶颈，发现现有模型不仅准确率不足，而且在视角和光照变化下性能急剧下降，表明模型过度依赖表面外观线索而忽略了表盘状态的固有几何结构。为此，作者提出了TriSCA框架，通过状态距离感知的表征对齐、基于元数据的观测到状态监督以及状态感知的目标对齐三个层次来提升模型的状态一致性。

Details

Motivation: 解决多模态大语言模型在基于表盘的测量读数任务中性能不佳的问题，特别是模型在面对外观变化时无法保持对同一内在状态的识别一致性，即模型忽略了任务的固有状态几何而依赖表面线索。

Result: 在受控的时钟和仪表基准测试以及外部真实世界基准测试上进行了广泛的消融研究和评估实验，证明了TriSCA方法的有效性，能够显著提升模型在表盘读数任务中的准确性和状态一致性。

Insight: 创新点在于诊断出MLLMs在表盘读数任务中忽略状态几何的根本问题，并提出了一个三层次的状态一致性对齐框架（TriSCA），通过表征、监督和目标三个层面的对齐来强制模型学习内在状态结构，而非依赖易变的外观特征。

Abstract: Multimodal large language models (MLLMs) have achieved impressive progress on general multimodal tasks, yet they remain brittle on dial-based measurement reading. In this paper, we study this problem through controlled benchmarks and feature-space probing, and show that current MLLMs not only achieve unsatisfactory accuracy on dial-based readout, but also suffer sharp performance drops under viewpoint and illumination changes even when the underlying dial state remains fixed. Our probing analysis further reveals that same-state samples under appearance variation are not consistently clustered, while neighboring states fail to preserve the local structure implied by continuous dial values. These findings suggest that existing MLLMs largely ignore the intrinsic state geometry of dial measurement tasks and instead rely on superficial appearance cues. Motivated by this diagnosis, we propose TriSCA, a tri-level state-consistent alignment framework for dial-based measurement reading. Specifically, TriSCA consists of state-distance-aware representation alignment, metadata-grounded observation-to-state supervision, and state-aware objective alignment. Extensive ablation studies and evaluation experiments on controlled clock and gauge benchmarks, together with evaluation on an external real-world benchmark, demonstrate the effectiveness of our method.

[57] SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection cs.CV | cs.AIPDF

Paul Julius Kühn, Mika Pommeranz, Arjan Kuijper, Saptarshi Neil Sinha

TL;DR: 本文提出了一个名为SynSur的端到端合成工业表面缺陷生成与检测流水线，旨在解决基于学习的工业缺陷检测中标记缺陷数据稀缺的瓶颈问题。该流水线结合了基于视觉语言模型的提示、LoRA适配的扩散模型、掩码引导的图像修复以及使用自动标签推导的样本过滤技术，生成逼真的合成缺陷样本。在球丝杠点蚀缺陷数据集和手机屏幕表面缺陷分割数据集子集上的评估表明，仅使用合成数据训练无法替代真实数据，但与真实数据结合时，合成缺陷可以保持性能并在特定训练方案中带来适度提升。

Details

Motivation: 工业缺陷检测的瓶颈通常不在于模型容量，而在于标记缺陷数据的稀缺性：缺陷罕见、标注成本高且难以收集平衡的训练集。

Result: 在球丝杠点蚀缺陷数据集（BSData）和手机屏幕表面缺陷分割数据集（MSD）子集上进行评估。使用YOLOv26、YOLOX和LW-DETR等检测器的实验表明，仅使用合成数据训练无法替代真实数据，但与真实数据结合时，合成缺陷可以保持性能并在特定BSData训练方案中带来适度增益。MSD跨域迁移研究表明，整体流水线结构可迁移到第二个工业检测领域。

Insight: 创新点在于构建了一个完整的、可分析的端到端合成缺陷生成与标注流水线，并系统评估了其各关键阶段（如提示构建、LoRA选择、基于DreamSim和CLIPScore的样本过滤）对生成样本真实性和实用性的影响。核心见解是，基于扩散的工业缺陷合成的主要价值在于增强稀缺的真实数据集，而非替代它们，强调了领域特定适应和标注质量控制的重要性。

Abstract: The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defects are rare, annotations are expensive, and collecting balanced training sets is slow. We present an end-to-end pipeline for synthetic defect generation and annotation, combining Vision-Language-Model-based prompts, LoRA-adapted diffusion, mask-guided inpainting, and sample filtering with automatic label derivation, and demonstrates the potential of real data with realistic synthetic samples to overcome data scarcity. The evaluation is conducted on, a challenging dataset of pitting defects on ball screw drives, and then on a subset of the Mobile phone screen surface defect segmentation dataset (MSD) dataset to test cross-domain transfer. Beyond downstream detector performance, we analyze key stages of the pipeline, including prompt construction, LoRA selection, and sample filtering with DreamSim and CLIPScore, to understand which synthetic samples are both realistic and useful. Experiments with YOLOv26, YOLOX, and LW-DETR show that synthetic-only training does not replace real data. When combined with real data, synthetic defects can preserve performance and yield modest gains in selected BSData training regimes. The MSD transfer study shows that the overall pipeline structure carries over to a second industrial inspection domain, while also highlighting the importance of domain-specific adaptation and annotation-quality control. Overall, the paper provides an end-to-end assessment of diffusion-based industrial defect synthesis and shows that its strongest value lies in strengthening scarce real datasets rather than substituting for them.

[58] CurEvo: Curriculum-Guided Self-Evolution for Video Understanding cs.CV | cs.LGPDF

Guiyi Zeng, Junqing Yu, Yi-Ping Phoebe Chen, Xu Chen, Wei Yang

TL;DR: 本文提出CurEvo，一种课程引导的自进化框架，通过将课程学习引入自进化过程，实现更结构化和渐进式的视频理解模型自主改进。该框架动态调节任务难度、细化评估标准并平衡数据多样性，形成一个与模型能力对齐的反馈循环，从而将弱控制的自进化转变为结构化学习过程。

Details

Motivation: 现有自进化视频理解框架缺乏迭代学习过程中的结构化指导，导致优化控制弱和难度进展不可控，因此需要引入课程学习来提供更系统的引导。

Result: 在四个VideoQA基准测试上，CurEvo在七个骨干网络上一致提升了基准准确率和基于评估器的语义分数，验证了课程引导自进化对视频理解的有效性。

Insight: 创新点在于将课程学习与自进化结合，通过多维自适应QA框架联合进化问题生成和答案评估，确保课程进展的连贯性和可测量性，实现了从弱控制到结构化学习的转变。

Abstract: Recent advances in self-evolution video understanding frameworks have demonstrated the potential of autonomous learning without human annotations. However, existing methods often suffer from weakly controlled optimization and uncontrolled difficulty progression, as they lack structured guidance throughout the iterative learning process. To address these limitations, we propose CurEvo, a curriculum-guided self-evolution framework that introduces curriculum learning into self-evolution to achieve more structured and progressive model improvement. CurEvo dynamically regulates task difficulty, refines evaluation criteria, and balances data diversity according to model competence, forming a curriculum-guided feedback loop that aligns learning complexity with model capability. Built upon this principle, we develop a multi-dimensional adaptive QA framework that jointly evolves question generation and answer evaluation across perception, recognition, and understanding dimensions, ensuring coherent and measurable curriculum progression. Through this integration, CurEvo transforms weakly controlled self-evolution into a more structured learning process for autonomous video understanding. Across seven backbones, CurEvo consistently improves both benchmark accuracy and evaluator-based semantic score on four VideoQA benchmarks, validating the effectiveness of curriculum-guided self-evolution for video understanding.

[59] Learning Sparse BRDF Measurement Samples from Image cs.CV | cs.GRPDF

Wen Cao

TL;DR: 本文提出了一种从图像中学习稀疏BRDF测量样本的方法，旨在通过少量测量高效重建材料外观。该方法结合了集合编码器、预训练的基于超网络的BRDF重建器和可微分渲染器，在训练过程中固定重建器，通过BRDF空间和渲染图像损失的梯度优化测量位置。实验表明，在MERL数据集上，该方法在8和16次测量时相比基线提升了低预算重建质量，而基于PCA的方法在更大预算下仍表现强劲。

Details

Motivation: 精确的BRDF采集对于真实感渲染至关重要，但密集的测角反射计测量既耗时又昂贵。因此，研究如何选择少量最有用的BRDF测量样本，以在学习的反射先验下重建材料外观。

Result: 在MERL数据集上的实验显示，该方法在8和16次测量时相比神经重建基线提升了低预算重建质量，而基于PCA的方法在更大测量预算下仍保持优势。

Insight: 创新点在于将样本选择与先验拟合分离，通过结合集合编码器、预训练重建器和可微分渲染器，利用梯度优化测量位置，鼓励采样器在学习的材料分布下选择信息丰富的方向。此外，分析了图像空间监督、协同优化和仅图像潜在拟合对未见材料的影响。

Abstract: Accurate BRDF acquisition is important for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small number of BRDF measurements that are most useful for reconstructing material appearance under a learned reflectance prior. Our method combines a set encoder for sparse coordinate-value observations, a pretrained hypernetwork-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor is kept fixed and gradients from BRDF-space and rendered-image losses are used to optimize measurement locations. This separates sample selection from prior fitting and encourages the sampler to choose directions that are informative under the learned material distribution. Experiments on the MERL dataset show that the proposed sampler improves low-budget reconstruction quality at 8 and 16 measurements compared with neural reconstruction baselines, while PCA-based methods remain strong at larger budgets. We further analyze the effect of image-space supervision, co-optimization, and image-only latent fitting for unseen materials.

[60] GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents cs.CVPDF

GLM-V Team, :, Wenyi Hong, Xiaotao Gu, Ziyang Pan

TL;DR: GLM-5V-Turbo是一个面向多模态智能体的原生基础模型，旨在将多模态感知深度整合为推理、规划、工具使用和执行的核心组成部分，而不仅仅是语言模型的辅助接口。

Details

Motivation: 随着基础模型在真实环境中部署，智能体的能力不仅依赖于语言推理，还需要感知、解释和操作图像、视频、网页、文档、GUI等多模态上下文，因此需要构建一个以多模态感知为核心的原生基础模型。

Result: 该模型在多模态编码、视觉工具使用和基于框架的智能体任务中表现出色，同时保持了有竞争力的纯文本编码能力。

Insight: 创新点在于将多模态感知作为智能体推理的核心组件，而非附加接口，并通过模型设计、多模态训练、强化学习、工具链扩展和与智能体框架集成等改进来实现；实践启示强调了多模态感知的中心地位、分层优化和可靠的端到端验证的重要性。

Abstract: We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

[61] TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection cs.CVPDF

Ahmed Abdullah, Nikolas Ebert, Oliver Wasenmüller

TL;DR: 本文系统评估了多种视觉基础模型（VFMs）在AI生成图像（AIGI）检测任务上的性能，发现现代VFMs显著优于原始CLIP-ViT。作者提出了一种简单的分类头重新设计方法——可调注意力池化（TAP），用于聚合补丁令牌特征，结合最新VFMs后在多个AIGI检测基准上取得了显著性能提升，并在两个具有挑战性的真实世界检测基准上达到了新的最先进水平。

Details

Motivation: 尽管已有大量基于CLIP-ViT的先进方法用于AIGI检测，但CLIP之后涌现的众多具有不同架构和训练范式的视觉基础模型（VFMs）在该任务上的潜力尚未被充分探索。本文旨在全面评估这些VFMs的即用性能，并探索如何更好地利用其特性。

Result: 在多个VFM家族上的基准测试表明，最佳模型的检测准确率比原始CLIP高出12%以上，超越了现有方法。提出的TAP方法与最新VFMs结合后，在多个AIGI检测基准上带来了显著的性能提升，并在两个用于真实世界AI生成和修复图像检测的挑战性基准上建立了新的SOTA。

Insight: 论文的核心创新点在于对多种现代视觉基础模型在AIGI检测任务上进行了首次系统性基准评估，并提出了一个简单有效的TAP模块来优化分类头，通过可学习的注意力机制聚合补丁令牌特征，从而更充分地利用现代VFM的表示能力。这为利用预训练基础模型进行下游任务提供了一个轻量且有效的特征聚合思路。

Abstract: Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP’s release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.

[62] MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification cs.CV | cs.AIPDF

Zuzheng Kuang, Honghao Chang, Boqiang Liang, Haoqian Wang, Lijun He

TL;DR: MemOVCD是一种无需训练的开集词汇变化检测框架，通过跨时序记忆推理和全局-局部自适应校正来解决双时相遥感图像中的语义变化检测问题。它将双时相变化检测重新定义为两帧跟踪问题，利用加权双向传播聚合语义证据，并通过直方图对齐的过渡帧平滑外观突变，同时采用全局-局部自适应校正策略融合局部和全局视图预测以提升空间一致性。

Details

Motivation: 现有方法通常独立处理每个时间戳或仅在最终比较阶段进行交互，导致语义推理中时序耦合不足，难以区分真实语义变化与非语义外观差异，且高分辨率图像上的块主导推理会削弱全局语义连续性并产生碎片化的变化区域。

Result: 在五个基准测试上的实验表明，MemOVCD在两种变化检测任务上取得了优越的性能，验证了其在多种开集词汇设置下的有效性和泛化能力。

Insight: 创新点包括将变化检测重构为跟踪问题以实现跨时序语义耦合，引入加权双向传播和直方图对齐过渡帧来增强记忆稳定性，以及通过全局-局部自适应校正策略平衡细节保留与空间一致性，这些思路可借鉴于其他时序视觉任务。

Abstract: Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods combine foundation models such as SAM, DINO and CLIP, but typically process each timestamp independently or interact only at the final comparison stage. Such paradigms suffer from insufficient temporal coupling during semantic reasoning, which limits their ability to distinguish genuine semantic changes from non-semantic appearance discrepancies. In addition, patch-dominant inference on high-resolution images often weakens global semantic continuity and produces fragmented change regions. To address these issues, we propose MemOVCD, a training-free open-vocabulary change detection framework based on cross-temporal memory reasoning and global-local adaptive rectification. Specifically, we reformulate bi-temporal change detection as a two-frame tracking problem and introduce weighted bidirectional propagation to aggregate semantic evidence from both temporal directions. To stabilize memory propagation across large temporal gaps, we construct histogram-aligned transition frames to smooth abrupt appearance changes. Moreover, a global-local adaptive rectification strategy adaptively fuses local and global-view predictions, improving spatial consistency while preserving fine-grained details. Experiments on five benchmarks demonstrate that MemOVCD achieves favorable performance on two change detection tasks, validating its effectiveness and generalization under diverse open-vocabulary settings.

[63] Virtual-reality based patient-specific simulation of spine surgical procedures: A fast, highly automated and high-fidelity system for surgical education and planning cs.CVPDF

Raj Kumar Ranabhat, Tayler D Ross, Tony Jiao, Jeremie Larouche, Joel Finkelstein

TL;DR: 本研究开发了一个基于虚拟现实（VR）和人工智能（AI）的快速、高度自动化、高保真的患者特异性脊柱手术模拟系统。该系统利用计算机断层扫描（CT）和磁共振成像（MRI）数据，自动构建三维解剖模型，并在虚拟手术室中模拟椎板切除术、椎间盘切除和椎间孔切开术等脊柱减压手术，旨在用于外科教育和术前规划。

Details

Motivation: 传统外科培训中，手术室直接接触机会有限，且现有的VR培训通常基于标准化场景，无法针对个体病例。本研究旨在利用AI计算机视觉方法，从患者CT和MRI数据生成患者特异性的模拟，提供一个安全、沉浸式且个性化的训练环境，以弥补这一不足。

Result: 系统能高效生成高保真患者特异性3D模型（每例约2.5分钟，N=15）。分割精度高，椎骨Dice相似系数（DSC）为0.95（±0.03），软组织为0.895（±0.02）。配准精度方面，平均目标配准误差（TRE）为1.73（±0.42）毫米。外科医生和学员的定性反馈表明，该系统提高了空间理解、手术信心，并具有显著的教育价值。

Insight: 创新点在于将AI驱动的多模态医学图像（CT/MRI）融合与分割技术，与VR模拟相结合，实现了快速、自动化的患者特异性手术模拟。这降低了建模时间和成本，为个性化术前规划和外科教育提供了新途径。从客观角度看，其工作流程的高度自动化是关键，使得从医学影像到可交互VR模拟的转化变得高效实用。

Abstract: Surgical training involves didactic teaching, mentor-led learning, surgical skills laboratories, and direct exposure to surgery; however, increasing clinical pressures have limited operating room (OR) exposure. This work leverages virtual reality (VR) to provide a safe and immersive training environment. Existing VR training is often based on standardized scenarios not tailored to individual clinical cases. This study addresses this limitation using artificial intelligence (AI) based computer vision methods to generate patient-specific simulations from computed tomography (CT) and magnetic resonance imaging (MRI). This study focuses on patient-specific spinal decompression simulation for spinal stenosis in a virtual operating room. The objectives were (1) automatic creation of 3D anatomical models and (2) VR simulation of spinal decompression procedures including laminectomy, disc resection, and foraminotomy. Model construction required multimodal fusion (registration) of CT and MRI and segmentation of relevant structures. Segmentation was evaluated using the Dice Similarity Coefficient (DSC), and registration accuracy using Target Registration Error (TRE). Qualitative feedback was obtained from surgeons and trainees. High-fidelity patient-specific 3D models were generated efficiently (approximately 2.5 minutes per case, N = 15). Segmentation accuracy was high, with a DSC of 0.95 (+/- 0.03) for vertebral bone and 0.895 (+/- 0.02) for soft tissue structures. Registration accuracy showed a mean TRE of 1.73 (+/- 0.42) mm. Semi-structured interviews indicated improved spatial understanding, increased procedural confidence, and strong perceived educational value. This platform significantly reduced the time and costs of patient-specific modelling, thereby facilitating pre-operative planning, post-procedural assessments, and comprehensive surgical simulation.

[64] Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization cs.CVPDF

Mingbo Hong, Feng Liu, Caroline Gevaert, George Vosselman, Hao Cheng

TL;DR: 本文提出了一种名为Bridge的新型领域泛化框架，该框架将因果推断与视觉基础模型（VFMs）相结合，旨在解决目标检测中因源域与目标域之间的分布差异导致的性能下降问题。Bridge通过学习用于前门调整的低秩基，阻断混杂因素的影响以减轻虚假相关性，同时通过过滤冗余和任务无关的组件来精炼表征。该方法可无缝集成判别式（如DINOv2/3, SAM）和生成式（如Stable Diffusion）视觉基础模型。

Details

Motivation: 解决目标检测器在单源域数据有限时，因模型倾向于依赖源域的混杂因素（如光照、共现、风格）而导致虚假相关性，从而阻碍模型泛化到新领域的问题。

Result: 在多个领域泛化目标检测数据集（Cross-Camera, Adverse Weather, Real-to-Artistic, Diverse Weather Datasets, 以及新增强的真实世界无人机基准Diverse Weather DroneVehicle）上的大量实验表明，该方法优于之前的最先进方法（SOTA）。

Insight: 创新点在于提出了一个基于低秩基的因果推断框架（Bridge），将前门调整机制引入目标检测以阻断混杂效应，并展示了其与多种视觉基础模型（VFMs）灵活集成的能力，为领域泛化提供了一种新的因果驱动解决方案。

Abstract: Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in single-source domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely \textbf{\textit{Bridge}}, that incorporates causal inference into object detection. By learning the low-rank bases for front-door adjustment, \textbf{\textit{Bridge}} blocks confounders’ effects to mitigate spurious correlations, while simultaneously refining representations by filtering redundant and task-irrelevant components. \textbf{\textit{Bridge}} can be seamlessly integrated with both discriminative (e.g., DINOv2/3, SAM) and generative (e.g., Stable Diffusion) Vision Foundation Models (VFMs). Extensive experiments across multiple domain generalization object detection datasets, i.e., Cross-Camera, Adverse Weather, Real-to-Artistic, Diverse Weather Datasets, and Diverse Weather DroneVehicle (our newly augmented real-world UAV-based benchmark), underscore the superiority of our proposed method over previous state-of-the-art approaches. The project page is available at: https://mingbohong.github.io/Bridge/.

[65] Uncertainty-Aware Pedestrian Attribute Recognition via Evidential Deep Learning cs.CVPDF

Zhuofan Lou, Shihang Zhang, Fangle Zhu, Shengjie Ye, Pingyu Wang

TL;DR: 本文提出了UAPAR，一个基于证据深度学习（EDL）的不确定性感知行人属性识别框架。该框架首次将EDL引入PAR任务，通过区域感知证据推理模块捕获细粒度局部特征并估计认知不确定性，同时采用不确定性引导的双阶段课程学习策略来缓解训练中的标签噪声问题。在多个标准数据集上的实验表明，UAPAR在保持竞争力的识别性能的同时，能有效识别不可靠预测。

Details

Motivation: 解决传统确定性PAR方法无法评估低质量样本预测可靠性的问题，旨在增强复杂现实场景下系统的鲁棒性。

Result: 在PA100K、PETA、RAPv1和RAPv2数据集上进行了广泛实验，结果表明UAPAR取得了具有竞争力或更优的性能，定性分析证实其生成的不确定性估计能有效预测具有挑战性或错误的样本。

Insight: 创新点在于首次将EDL与基于CLIP的架构结合用于PAR，并设计了区域感知证据推理模块和不确定性引导的课程学习策略；客观来看，其将不确定性量化与噪声鲁棒性训练相结合的系统性方法，对提升视觉识别模型在开放环境中的可信赖性具有借鉴意义。

Abstract: We propose UAPAR, an Uncertainty-Aware Pedestrian Attribute Recognition framework. To the best of our knowledge, this is the first EDL-based uncertainty-aware framework for pedestrian attribute recognition (PAR). Unlike conventional deterministic methods, which fail to assess prediction reliability on low-quality samples, UAPAR effectively identifies unreliable predictions and thus enhances system robustness in complex real-world scenarios. To achieve this, UAPAR incorporates Evidential Deep Learning (EDL) into a CLIP-based architecture. Specifically, a Region-Aware Evidence Reasoning module employs cross-attention and spatial prior masks to capture fine-grained local features, which are further processed by an evidence head to estimate attribute-wise epistemic uncertainty. To further enhance training robustness, we develop an uncertainty-guided dual-stage curriculum learning strategy to alleviate the adverse effects of severe label noise during training. Extensive experiments on the PA100K, PETA, RAPv1, and RAPv2 datasets demonstrate that UAPAR achieves competitive or superior performance. Furthermore, qualitative results confirm that the proposed framework generates uncertainty estimates that are predictive of challenging or erroneous samples.

[66] Graph-based Semantic Calibration Network for Unaligned UAV RGBT Image Semantic Segmentation and A Large-scale Benchmark cs.CVPDF

Fangqiang Fan, Zhicheng Zhao, Xiaoliang Ma, Chenglong Li, Jin Tang

TL;DR: 本文提出了一种基于图语义校准网络（GSCNet）的方法，用于解决无人机RGB-T热图像语义分割中因传感器视差和平台振动导致的跨模态空间未对齐问题，以及俯视视角下细粒度地物类别间的语义混淆问题。同时，论文构建了目前最大、最细粒度的未对齐无人机RGB-T图像语义分割基准数据集URTF。

Details

Motivation: 解决无人机RGB-T图像语义分割中两个耦合的挑战：由传感器视差和平台振动引起的跨模态空间未对齐，以及在俯视航拍视角下细粒度地物类别间严重的语义混淆问题。

Result: 在提出的URTF基准数据集上进行的大量实验表明，GSCNet显著优于现有最先进（SOTA）的方法，在细粒度类别上取得了显著的性能提升。

Insight: 创新点包括：1. 特征解耦与对齐模块（FDAM），通过将模态特征解耦为共享结构和私有感知分量，并在共享子空间进行可变形对齐，实现了鲁棒的空间校正并减少了模态外观干扰。2. 语义图校准模块（SGCM），将无人机场景中地物类别的层次分类学和共现规律显式编码为结构化类别图，并通过图注意力推理将这些先验知识融入预测过程，以校准视觉上相似和稀有类别的预测。此外，构建的大规模、细粒度、具有真实未对齐特性的URTF数据集也是一个重要的贡献。

Abstract: Fine-grained RGBT image semantic segmentation is crucial for all-weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT semantic segmentation faces two coupled challenges: cross-modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine-grained ground objects under top-down aerial views. To address these issues, we propose a Graph-based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment in the shared subspace, enabling robust spatial correction with reduced modality appearance interference. Moreover, we propose a Semantic Graph Calibration Module (SGCM) that explicitly encodes the hierarchical taxonomy and co-occurrence regularities among ground-object categories in UAV scenes into a structured category graph, and incorporates these priors into graph-attention reasoning to calibrate predictions of visually similar and rare categories.In addition, we construct the Unaligned RGB-Thermal Fine-grained (URTF) benchmark, to the best of our knowledge, the largest and most fine-grained benchmark for unaligned UAV RGBT image semantic segmentation, containing over 25,000 image pairs across 61 categories with realistic cross-modal misalignment. Extensive experiments on URTF demonstrate that GSCNet significantly outperforms state-of-the-art methods, with notable gains on fine-grained categories. The dataset is available at https://github.com/mmic-lcl/Datasets-and-benchmark-code.

[67] Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction cs.CVPDF

David Novikov, Eilon Vaknin, Narek Tumanyan, Mark Sheinin

TL;DR: 本文提出了一种利用未增强的低速相机捕获和重建高速场景体积表示的新方法。通过使用快速顺序颜色编码的照明序列来编码高速场景动态，实现了同时多视角捕获，其中高速时间信息被编码在捕获图像的空间强度和颜色变化中。为了构建动态场景的高速体积表示，作者开发了一种基于动态高斯泼溅的新方法，从图像中解码时间信息。

Details

Motivation: 传统相机受限于30-60 FPS的带宽，难以捕获高速动态场景的3D表示，而现有计算成像方法通常需要修改相机硬件或添加机械移动部件，限制了单视角高速捕获。本文旨在克服这些限制，利用低速相机实现多视角高速体积重建。

Result: 在模拟场景和真实世界多相机成像设置中进行了评估，展示了首创的高速体积场景重建结果，验证了方法的有效性。

Insight: 创新点在于通过颜色编码照明序列在空间强度和颜色变化中编码时间信息，避免了硬件修改，并结合动态高斯泼溅方法解码，实现了低成本、多视角的高速体积重建。

Abstract: The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years. However, most conventional cameras are bandwidth-limited to 30-60 FPS, restricting these methods to static or slowly evolving scenes. While overcoming bandwidth limitations is difficult for general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific applications (e.g., motion capture and particle image velocimetry). However, most of these methods require modifications to a camera’s optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these methods cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed scene using only unaugmented low-speed cameras. Instead of modifying the hardware or optics of each individual camera, we encode high-speed scene dynamics by illuminating the scene with a rapid, sequential color-coded sequence. This results in simultaneous multi-view capture of the scene, where high-speed temporal information is encoded in the spatial intensity and color variations of the captured images. To construct a high-speed volumetric representation of the dynamic scene, we develop a novel dynamic Gaussian Splatting-based approach that decodes the temporal information from the images. We evaluate our approach on simulated scenes and real-world experiments using a multi-camera imaging setup, showing first-of-a-kind high-speed volumetric scene reconstructions.

[68] World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning cs.CVPDF

Wanyue Zhang, Wenxiang Wu, Wang Xu, Jiaxin Luo, Helu Zhi

TL;DR: 本文提出World2VLM训练框架，通过从生成式世界模型中蒸馏空间想象力到视觉语言模型，以增强其动态空间推理能力。该方法利用世界模型根据初始观察和相机轨迹合成几何对齐的未来视图，并生成用于前向和逆向空间推理的结构化监督数据，在多个基准测试上取得了优于基础模型和推理时耦合方法的性能。

Details

Motivation: 解决现有视觉语言模型在需要以自我为中心运动想象场景演变的动态空间推理任务上表现不佳的问题，同时避免现有方法（如合成数据扩增缺乏显式状态转移建模，或推理时耦合世界模型带来高计算开销）的局限性。

Result: 在SAT-Real、SAT-Synthesized、VSI-Bench和MindCube等多个空间推理基准测试上，World2VLM相比基础模型取得了一致的性能提升，并且超越了需要昂贵推理时生成的测试时世界模型耦合方法。

Insight: 创新点在于将世界模型从推理时工具转变为训练时教师，通过蒸馏过程使VLM内化空间想象力，实现了可扩展且高效的动态空间推理能力提升，无需在推理时进行高开销的生成。

Abstract: Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.

[69] ProcFunc: Function-Oriented Abstractions for Procedural 3D Generation in Python cs.CVPDF

Alexander Raistrick, Karhan Kayan, Jack Nugent, David Yan, Lingjie Mei

TL;DR: ProcFunc是一个基于Blender的Python库，专注于程序化3D生成，通过提供易于使用的函数库来简化创建、组合、分析和执行程序化生成代码的过程。它支持通过语义组件的组合生成大规模多样化训练数据，并利用视觉语言模型（VLM）编辑程序化材质和几何代码，减少编码错误。作为一个应用案例，该库开发了一个室内房间程序化生成器，包含新的组合程序化材质，展示了其细节、运行效率和多样性，适用于3D合成数据生成。

Details

Motivation: 解决程序化3D生成中代码编写复杂、易出错的问题，以及大规模多样化训练数据生成的需求，通过提供函数化抽象来简化流程。

Result: 论文展示了ProcFunc库在室内房间生成器上的应用，实现了细节丰富、运行高效且多样化的3D合成数据生成，但未提及具体基准测试或与现有方法的定量比较。

Insight: 创新点在于将程序化3D生成抽象为易于使用的Python函数库，支持语义组件组合和VLM辅助代码编辑，降低了技术门槛并提高了生成效率，可借鉴于其他3D生成工具开发。

Abstract: We introduce ProcFunc, a library for Blender-based procedural 3D generation in Python. ProcFunc provides a library of easy-to-use Python functions, which streamline creating, combining, analyzing, and executing procedural generation code. ProcFunc makes it easy to create large-scale diverse training data, by combinatorial compositions of semantic components. VLMs can use ProcFunc to edit procedural material and geometry code and can create new procedural code with significantly fewer coding errors. Finally, as an example use case, we use ProcFunc to develop a new procedural generator of indoor rooms, which includes a collection of new compositional procedural materials. We demonstrate the detail, runtime efficiency, and diversity of this room generator, as well as its use for 3D synthetic data generation. Please visit https://github.com/princeton-vl/procfunc for source code.

Wanrong Zheng, Yunhao Ge, Laurent Itti

TL;DR: 本文提出了一种名为Three-Step Nav的分层全局-局部规划器，用于零样本视觉与语言导航任务。该方法采用三视图协议：首先“向前看”提取全局地标并制定粗略计划，然后“现在看”将当前视觉观察与下一个子目标对齐以提供细粒度指导，最后“向后看”在停止前审核整个轨迹以纠正累积漂移。该方法无需梯度更新或任务特定微调，即可集成到现有VLN流程中，并在R2R-CE和RxR-CE数据集上实现了最先进的零样本性能。

Details

Motivation: 当前基于多模态大语言模型的零样本视觉与语言导航代理仍存在容易偏离航线、过早停止和整体成功率低的问题，本文旨在通过分层规划策略来应对这些失败情况。

Result: 在R2R-CE和RxR-CE数据集上实现了最先进的零样本性能。

Insight: 创新点在于提出了一个无需训练的三步分层规划框架（全局地标提取、局部对齐和轨迹审核），通过结合前瞻、当前和后顾的视图来系统性地解决导航中的漂移和规划不完整问题，这是一种轻量级且可插拔的规划策略增强方法。

Abstract: Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, “look forward” to extract global landmarks and sketch a coarse plan. Then, “look now” to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, “look backward” audits the entire trajectory to correct accumulated drift before stopping. Requiring no gradient updates or task-specific fine-tuning, our planner drops into existing VLN pipelines with minimal overhead. Three-Step Nav achieves state-of-the-art zero-shot performance on the R2R-CE and RxR-CE dataset. Our code is available at https://github.com/ZoeyZheng0/3-step-Nav.

cs.DB [Back]

[71] CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering cs.DB | cs.CLPDF

Yushi Sun, Lei Chen

TL;DR: 本文提出CacheRAG，一种用于知识图谱问答（KGQA）的语义缓存系统，旨在解决现有LLM驱动的KGQA系统因作为无状态规划器而导致的模式幻觉和检索覆盖有限的问题。该系统通过引入与模式无关的用户界面、多样性优化的缓存检索和有界启发式扩展三大设计原则，将无状态规划器转变为持续学习者，从而显著提升性能。

Details

Motivation: 现有基于LLM的KGQA系统作为无状态规划器，孤立生成检索计划，无法利用历史查询模式，这类似于数据库系统每次从头优化查询而无计划缓存，导致模式幻觉和检索覆盖有限。

Result: 在多个基准测试上的广泛实验表明，CacheRAG显著优于最先进的基线方法，例如在CRAG数据集上准确率提升13.2%，真实性提升17.5%。

Insight: 创新点包括：1）通过中间语义表示（ISR）实现模式无关的用户界面，允许非专家用户纯自然语言交互；2）结合最大边际相关性（MMR）的两层分层索引优化缓存检索多样性，缓解推理同质化；3）具有严格复杂度保证的有界启发式扩展操作符，显著提高检索召回率而不引发无界API执行风险。从客观角度看，该系统将数据库缓存思想适配于LLM上下文，为KGQA提供了系统级优化框架。

Abstract: The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answering (KGQA). However, existing LLM-driven KGQA systems act as stateless planners, generating retrieval plans in isolation without exploiting historical query patterns: analogous to a database system that optimizes every query from scratch without a plan cache. This fundamental design flaw leads to schema hallucinations and limited retrieval coverage. We propose CacheRAG, a systematic cache-augmented architecture for LLM-based KGQA that transforms stateless planners into continual learners. Unlike traditional database plan caching (which optimizes for frequency), CacheRAG introduces three novel design principles tailored for LLM contexts: (1) Schema-agnostic user interface: A two-stage semantic parsing framework via Intermediate Semantic Representation (ISR) enables non-expert users to interact purely in natural language, while a Backend Adapter grounds the LLM with local schema context to compile executable physical queries safely. (2) Diversity-optimized cache retrieval: A two-layer hierarchical index (Domain $\rightarrow$ Aspect) coupled with Maximal Marginal Relevance (MMR) maximizes structural variety in cached examples, effectively mitigating reasoning homogeneity. (3) Bounded heuristic expansion: Deterministic depth and breadth subgraph operators with strict complexity guarantees significantly enhance retrieval recall without risking unbounded API execution. Extensive experiments on multiple benchmarks demonstrate that CacheRAG significantly outperforms state-of-the-art baselines (e.g., +13.2% accuracy and +17.5% truthfulness on the CRAG dataset).

eess.IV [Back]

[72] Adaptive Transform Coding for Semantic Compression eess.IV | cs.CV | cs.IT | eess.SPPDF

Andriy Enttsel, Vincent Corlay

TL;DR: 本文提出了一种用于语义特征压缩的自适应变换编码方法，该方法基于高斯混合模型的条件率失真函数，通过根据推断的源成分选择模式相关的变换和量化器，实现对异构特征分布更高效的编码。

Details

Motivation: 视觉数据压缩正从以人为中心的重建转向面向机器的表示编码，需要将图像映射为紧凑的语义嵌入并进行压缩传输以支持下游推理，现有方法在编码异构特征分布时效率不足。

Result: 在广泛使用的视觉骨干网络和基础模型提取的特征上进行评估，所提方法优于或与最先进的神经压缩方法竞争，同时保持了灵活性和可解释性。

Insight: 创新点在于将高斯混合模型的条件率失真理论应用于语义特征压缩，设计了模式自适应的变换和量化策略，从而更高效地处理特征分布的异质性，且方法兼具性能与可解释性。

Abstract: Visual data compression is shifting from human-centered reconstruction to machine-oriented representation coding. In this setting, an image is often mapped to a compact semantic embedding, which is then compressed and transmitted for downstream inference. We propose an adaptive transform-coding method for semantic-feature compression motivated by the conditional rate-distortion function of a Gaussian mixture model. The scheme uses mode-dependent transforms and quantizers selected according to the inferred source component, enabling more efficient coding of heterogeneous feature distributions. Evaluations on features from widely used vision backbones and foundation models show that the proposed method outperforms or is competitive with state-of-the-art neural compression methods while preserving flexibility and interpretability.

cs.IR [Back]

[73] When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models cs.IR | cs.AI | cs.CLPDF

Dongxin Guo, Jikun Wu, Siu Ming Yiu

TL;DR: 本文提出了ReaLM-Retrieve，一个推理感知的检索框架，旨在解决大型推理模型（如DeepSeek-R1）与现有检索增强生成（RAG）系统在推理过程中证据注入时机不匹配的问题。该框架通过步骤级不确定性检测、学习何时检索的干预策略以及效率优化的集成机制，在多个多跳推理基准上实现了更高的答案准确性和更少的检索调用。

Details

Motivation: 当前RAG系统在推理开始前就提供上下文，而大型推理模型需要在多步推理链中进行证据注入，两者存在根本性的不匹配。本文旨在解决这一错位问题。

Result: 在MuSiQue、HotpotQA和2WikiMultiHopQA基准测试中，ReaLM-Retrieve相比标准RAG在答案F1上平均绝对提升10.1%，相比IRCoT等固定间隔方法减少47%的检索调用，并在需要2-4跳推理的MuSiQue上达到71.2%的F1（平均每问题仅1.8次检索）。在检索质量上，Recall@5达到81.3%，并实现了新的SOTA效率-准确性权衡。

Insight: 创新点在于将检索时机与推理过程动态对齐：1）在推理步骤粒度而非词/句级别检测知识缺口；2）学习最优的检索干预策略；3）优化集成机制降低开销。这为推理密集型任务的RAG系统设计提供了新范式。

Abstract: Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current RAG systems optimize for providing context before reasoning begins, while reasoning models require evidence injection during multi-step inference chains. We introduce ReaLM-Retrieve, a reasoning-aware retrieval framework that addresses this mismatch through three key innovations: (1) a step-level uncertainty detector that identifies knowledge gaps at reasoning-step granularity rather than token or sentence level; (2) a retrieval intervention policy that learns when external evidence maximally benefits ongoing reasoning; and (3) an efficiency-optimized integration mechanism that reduces per-retrieval overhead by 3.2x compared to naive integration. Experiments on MuSiQue, HotpotQA, and 2WikiMultiHopQA demonstrate that ReaLM-Retrieve achieves on average 10.1% absolute improvement in answer F1 over standard RAG (range: 9.0-11.8% across the three benchmarks) while reducing retrieval calls by 47% compared to fixed-interval approaches like IRCoT (all improvements significant at p<0.01, paired bootstrap). On the challenging MuSiQue benchmark requiring 2-4 hop reasoning, our method achieves 71.2% F1 with an average of only 1.8 retrieval calls per question. Analysis shows that ReaLM-Retrieve also improves retrieval quality itself, achieving 81.3% Recall@5 with consistently higher precision and MRR than fixed-interval baselines on supporting evidence, establishing new state-of-the-art efficiency-accuracy trade-offs for reasoning-intensive retrieval tasks.

cs.SE [Back]

[74] SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent cs.SE | cs.CLPDF

Yikai Zhang, Jiaxin Pei, Kenan Li, Maoquan Wang, Jin Pan

TL;DR: 本文提出了SWE-Edit，一种用于软件工程代理（SWE-Agent）的新型代码编辑框架。它通过将代码编辑过程解耦为专门的查看器（Viewer）和编辑器（Editor）子代理，解决了传统方法中代码检查、修改规划和编辑执行在单一上下文窗口中耦合导致性能下降的问题。研究还探讨了有效的编辑模型，通过训练Qwen3-8B模型自适应选择编辑模式，提高了编辑效率。

Details

Motivation: 当前基于大语言模型的软件工程代理在代码编辑任务中存在根本性的上下文耦合问题：标准的编辑界面将代码检查、修改规划和编辑执行混在同一个上下文窗口中，导致无关信息累积并降低代理性能。

Result: 在SWE-bench Verified基准测试上，SWE-Edit将解决率（resolved rate）提高了2.1%，同时将推理成本降低了17.9%。该方法在编辑效率上优于单一格式的基线模型。

Insight: 核心创新在于将代码编辑任务解耦为专注推理的主代理和负责上下文密集型操作的专用子代理（Viewer和Editor），从而优化了上下文管理。此外，通过训练模型自适应选择编辑模式（而非仅依赖查找替换格式）以及提出可预测下游代理性能的代码编辑基准，为编辑模型的选择提供了实用指导。

Abstract: Large language model agents have achieved remarkable progress on software engineering tasks, yet current approaches suffer from a fundamental context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit execution within a single context window, forcing agents to interleave exploratory viewing with strictly formatted edit generation. This causes irrelevant information to accumulate and degrades agent performance. To address this, we propose SWE-Edit, which decomposes code editing into two specialized subagents: a Viewer that extracts task-relevant code on demand, and an Editor that executes modifications from high-level plans–allowing the main agent to focus on reasoning while delegating context-intensive operations to clean context windows. We further investigate what makes an effective editing model: observing that the prevalent find-and-replace format is error-prone, we train Qwen3-8B with GRPO to adaptively select editing modes, yielding improved editing efficiency over single-format baselines. On SWE-bench Verified, SWE-Edit improves resolved rate by 2.1% while reducing inference cost by 17.9%. We additionally propose a code editing benchmark that reliably predicts downstream agentic performance, providing practical guidance for editing model selection. Our code is publicly available at https://github.com/microsoft/SWE-Edit.

cs.LG [Back]

[75] Entropy Centroids as Intrinsic Rewards for Test-Time Scaling cs.LG | cs.AI | cs.CLPDF

Wenshuo Zhao, Qi Zhu, Xingshan Zeng, Fei Mi, Lifeng Shang

TL;DR: 本文提出了一种名为’熵质心’的内在奖励机制，用于在测试时扩展大型语言模型的计算能力。该方法通过采样多个响应并选择熵质心最低的响应，无需依赖外部奖励模型，即可有效提升模型在数学、代码生成、逻辑推理和智能体任务上的性能。

Details

Motivation: 现有测试时扩展方法通常依赖外部奖励模型进行响应选择，这需要训练强大的奖励模型并引入额外计算开销。作为替代，先前方法探索了如置信度和熵等内在信号，但这些信号在简单聚合时存在噪声。本文观察到高熵标记在推理过程中倾向于聚集成连续组，这提供了比单个标记更稳定的模型不确定性度量。

Result: 在数学、代码生成、逻辑推理和智能体任务上，对模型规模从14B到480B的实验表明，所提出的’最低质心’方法持续优于现有基线方法，并且随着模型规模增大能提供稳定的性能增益。

Insight: 主要创新点在于利用模型不确定性的时间结构作为内在奖励，具体形式化为’高熵阶段’和’熵质心’概念。熵质心作为高熵阶段沿轨迹的加权平均位置，其值越低（表示早期探索后接自信生成）往往对应更高的响应质量，这为无需外部奖励的响应选择提供了新的稳定信号。

Abstract: An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high-entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment-level uncertainty as the High Entropy Phase (HEP), a variable-length segment that begins at a high-entropy token and ends when consecutive low-entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at https://github.com/hkust-nlp/entropy-centroid.

[76] Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control cs.LG | cs.CL | stat.MLPDF

Bolian Li, Yifan Wang, Yi Ding, Anamika Lochab, Ananth Grama

TL;DR: 本文提出了一种名为Entrocraft的简单拒绝采样方法，通过偏置优势分布来实现用户定制的熵调度，以解决大型语言模型强化学习中常见的性能饱和问题。该方法无需目标正则化且与优势估计器无关，理论上将每步熵变化与优势分布联系起来，并通过实验发现线性退火熵调度效果最佳。

Details

Motivation: 现有强化学习算法在训练大型语言模型时容易出现性能饱和，这通常由熵崩溃导致，而现有方法通过正则化或裁剪来防止熵崩溃，但其熵曲线长期不稳定，阻碍了性能提升。

Result: 在实验中，Entrocraft显著提高了泛化能力、输出多样性和长期训练效果，使一个4B参数的模型性能超过8B基线，在达到平台期前持续改进时间延长4倍，并将pass@K指标提升了50%。

Insight: 创新点在于提出了一种简单且通用的熵曲线控制方法Entrocraft，通过定制熵调度来稳定训练过程，避免了现有方法的长期不稳定性，并系统研究了不同熵调度策略，发现线性退火最优，为强化学习中的探索问题提供了新视角。

Abstract: Reinforcement learning (RL) has unlocked complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing further gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts have tried to prevent entropy collapse through regularization or clipping, but their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes any user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions, which explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, where we find that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.

[77] Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding cs.LG | cs.CLPDF

Hayate Iso, Tiyasa Mitra, Sudipta Mondal, Rasoul Shafipour, Venmugil Elango

TL;DR: 本文研究了在强化学习（RL）后训练中，使用推测解码（speculative decoding）作为无损加速原语来提升自回归rollout生成的效率。通过将推测解码集成到NeMo-RL框架中，支持同步和异步流水线，在推理后训练任务中实现了1.8倍的rollout吞吐量提升，并模拟预测在异步RL下可获得高达2.5倍的端到端训练加速。

Details

Motivation: 前沿语言模型的RL后训练正日益受到自回归rollout生成的瓶颈制约，使得rollout加速成为一个核心系统挑战。现有方法通常通过改变rollout或优化机制（如离策略执行、回放或低精度生成）来提高吞吐量，但本文旨在探索一种能保持目标模型输出分布的无损加速方法。

Result: 在8B规模的同步RL推理后训练任务中，推测解码将rollout吞吐量提升了1.8倍。通过高保真性能模拟器预测，在235B规模下结合异步RL，可实现高达2.5倍的端到端训练加速。

Insight: 论文的创新点在于将推测解码这一传统用于推理后阶段的技术，首次系统地集成到RL训练流程中，作为无损加速rollout的原语。这为在RL训练内部部署最先进的推测解码技术（如预训练的MTP头、小型外部草稿模型或Eagle3等技术）提供了可行的路径，且支持同步和异步流水线，显著提升了训练效率。

Abstract: RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model’s output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.

[78] Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models cs.LG | cs.AI | cs.CV | cs.DC | cs.NIPDF

Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng, Chrysa Papagianni

TL;DR: 本文提出了一种用于边缘-云视觉语言模型（VLM）推理的渐进式语义通信框架，通过一个元自动编码器将视觉令牌压缩成自适应、可渐进细化的表示，从而在带宽受限的网络环境中实现通信成本与语义保真度之间的可控权衡，无需对现有VLM进行额外微调即可即插即用部署。

Details

Motivation: 解决在资源受限的边缘设备上部署计算和内存需求巨大的VLM的挑战，以及完全卸载推理到云端在带宽有限环境下因传输原始视觉数据导致高延迟的问题，同时克服现有边缘-云协作架构传输固定大小表示、缺乏对动态网络条件适应性且未能充分利用语义冗余的不足。

Result: 在由嵌入式NXP i.MX95平台和GPU服务器组成的端到端边缘-云系统上，在1 Mbps上行链路条件下进行实验，结果表明所提出的渐进式方案相比全边缘和全云端解决方案显著降低了网络延迟，即使在高压缩下也能保持高语义一致性。

Insight: 创新点在于使用元自动编码器生成自适应、可渐进细化的视觉表示，实现了通信与语义保真度的灵活权衡，并支持与现成VLM的即插即用集成；客观分析认为其核心贡献是提出了一种动态适应网络条件的语义压缩机制，有效利用了视觉数据的语义冗余，为边缘-云协同推理提供了可扩展的解决方案。

Abstract: Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth-limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge-cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed-size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge-cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug-and-play deployment with off-the-shelf VLMs without additional fine-tuning. This design allows flexible transmission at different information levels, providing a controllable trade-off between communication cost and semantic fidelity. We implement a full end-to-end edge-cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth-constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full-edge and full-cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at https://github.com/open-ep/ProSemComVLM.

cs.AI [Back]

[79] Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems cs.AI | cs.CV | cs.LG | cs.LOPDF

Mahnoor Shahid, Hannes Rothe

TL;DR: 本文通过引入迭代逻辑张量网络（iLTN）这一全可微分架构，首次系统性地实证分析了神经符号系统中符号接地与组合推理之间的关系，挑战了‘符号接地成功会自动带来组合推理能力’的假设。研究表明，仅基于接地目标训练的模型无法实现泛化，而联合训练感知接地和多步推理的iLTN模型则在多个零样本任务上取得了高准确率。

Details

Motivation: 现代神经网络在组合泛化方面存在根本性弱点，限制了其在需要分布外推理领域的鲁棒性和适用性。神经符号AI领域一个核心但未经证实的假设是：组合推理能力会作为符号接地成功的副产品而自然涌现。本文旨在通过分离接地和推理的贡献来挑战这一假设。

Result: 在针对新颖实体、未见关系和复杂规则组合的泛化分类任务上，仅进行接地训练的模型泛化失败，而联合训练接地与多步推理的完整iLTN模型在所有任务上均实现了高零样本准确率。

Insight: 论文的核心创新点在于明确论证了符号接地虽然是必要的，但不足以实现组合泛化，推理并非涌现属性，而是一种需要明确学习目标的独立能力。这挑战了神经符号AI的一个基础假设，并提出了一个可操作的、全可微分的多步推理架构（iLTN）来联合学习这两项能力。

Abstract: Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will emerge as a byproduct of successful symbol grounding. This work presents the first systematic empirical analysis to challenge this assumption by disentangling the contributions of grounding and reasoning. To operationalize this investigation, we introduce the Iterative Logic Tensor Network ($i$LTN), a fully differentiable architecture designed for multi-step deduction. Using a formal taxonomy of generalization – probing for novel entities, unseen relations, and complex rule compositions – we demonstrate that a model trained solely on a grounding objective fails to generalize. In contrast, our full $i$LTN, trained jointly on perceptual grounding and multi-step reasoning, achieves high zero-shot accuracy across all tasks. Our findings provide conclusive evidence that symbol grounding, while necessary, is insufficient for generalization, establishing that reasoning is not an emergent property but a distinct capability that requires an explicit learning objective.

cs.CR [Back]

[80] LATTICE: Evaluating Decision Support Utility of Crypto Agents cs.CR | cs.AI | cs.CLPDF

Aaron Chan, Tengfei Li, Tianyi Xiao, Angela Chen, Junyi Du

TL;DR: 论文提出了LATTICE基准，用于在面向用户的现实场景中评估加密代理的决策支持效用。该基准通过定义六个评估维度、提出16个任务类型，并利用LLM法官自动评分，填补了现有基准主要关注推理或结果评估、而忽视辅助用户决策能力的空白。

Details

Motivation: 现有加密代理基准主要关注基于推理或结果的评估，未能评估代理辅助用户决策的能力，LATTICE旨在填补这一空白。

Result: 论文使用LATTICE评估了六个现实世界加密副驾驶产品在1200个多样化查询上的表现，结果显示大多数测试副驾驶的总体得分相当，但在维度和任务级别上存在更显著的性能差异。

Insight: 创新点在于设计了可扩展、可审计的LLM法官评估框架，强调编排和UI/UX设计对代理质量的重要性，并揭示了决策支持质量中存在有意义的权衡，表明不同优先级的用户可能需要不同的副驾驶，而非仅依赖总体排名。

Abstract: We introduce LATTICE, a benchmark for evaluating the decision support utility of crypto agents in realistic user-facing scenarios. Prior crypto agent benchmarks mainly focus on reasoning-based or outcome-based evaluation, but do not assess agents’ ability to assist user decision-making. LATTICE addresses this gap by: (1) defining six evaluation dimensions that capture key decision support properties; (2) proposing 16 task types that span the end-to-end crypto copilot workflow; and (3) using LLM judges to automatically score agent outputs based on these dimensions and tasks. Crucially, the dimensions and tasks are designed to be evaluable at scale using LLM judges, without relying on ground truth from expert annotators or external data sources. In lieu of these dependencies, LATTICE’s LLM judge rubrics can be continually audited and updated given new dimensions, tasks, criteria, and human feedback, thus promoting reliable and extensible evaluation. While other benchmarks often compare foundation models sharing a generic agent framework, we use LATTICE to assess production-level agents used in actual crypto copilot products, reflecting the importance of orchestration and UI/UX design in determining agent quality. In this paper, we evaluate six real-world crypto copilots on 1,200 diverse queries and report breakdowns across dimensions, tasks, and query categories. Our experiments show that most of the tested copilots achieve comparable aggregate scores, but differ more significantly on dimension-level and task-level performance. This pattern suggests meaningful trade-offs in decision support quality: users with different priorities may be better served by different copilots than the aggregate rankings alone would indicate. To support reproducible research, we open-source all LATTICE code and data used in this paper.

cs.RO [Back]

[81] Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising cs.RO | cs.AI | cs.CVPDF

Jun Guo, Qiwei Li, Peiyan Li, Zilong Chen, Nan Sun

TL;DR: 本文提出X-WAM，一个统一的4D世界模型，将实时机器人动作执行与高保真4D世界合成（视频+3D重建）整合在单一框架中。该方法利用预训练视频扩散模型的视觉先验，通过预测多视角RGB-D视频来想象未来世界，并采用轻量级结构适应高效获取空间信息。此外，提出异步噪声采样（ANS）来联合优化生成质量和动作解码效率，在推理时使用异步去噪调度实现高效实时执行，同时保持高保真视频生成。

Details

Motivation: 解决现有统一世界模型（如UWM）仅建模2D像素空间、无法平衡动作效率与世界建模质量的局限性，旨在实现实时机器人动作执行与高保真4D世界合成的统一。

Result: 在RoboCasa和RoboTwin 2.0基准测试中分别达到79.2%和90.7%的平均成功率，同时在视觉和几何指标上超越现有方法，实现高保真4D重建与生成。

Insight: 创新点包括：1）通过复制预训练扩散Transformer的最后几个块构建专用深度预测分支的轻量级结构适应方法；2）异步噪声采样（ANS）技术，在推理时采用异步去噪调度平衡效率与质量，训练时从联合分布采样以对齐推理分布。

Abstract: We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.

cs.HC [Back]

[82] Beyond Screenshots: Evaluating VLMs’ Understanding of UI Animations cs.HC | cs.CLPDF

Chen Liang, Xirui Jiang, Naihao Deng, Eytan Adar, Anhong Guo

TL;DR: 该论文针对当前视觉语言模型在用户界面理解研究中主要关注静态截图而忽略动态动画的问题，提出了AniMINT数据集，并系统评估了VLMs对UI动画的理解能力，发现其在基础运动感知上可靠，但在高级语义解释上存在不足。

Details

Motivation: 解决现有视觉语言模型在理解用户界面动态动画能力上的研究空白，因为动画是现代界面传达状态和反馈的核心功能模态，对AI代理可靠操作至关重要。

Result: 在自建的AniMINT数据集（包含300个密集标注的UI动画视频）上评估，VLMs能可靠检测基础运动，但其高级动画解释能力不一致，与人类性能存在显著差距。

Insight: 创新点在于构建了首个专注于UI动画理解的数据集，并通过运动、上下文和感知线索分析揭示了影响VLM性能的关键瓶颈，为未来改进指明了方向。

Abstract: AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond mere aesthetics. Thus, understanding UI animation is essential for comprehensive interface interpretation. However, recent studies of Vision Language Models (VLMs) for UI understanding have focused primarily on static screenshots, leaving it unclear how well these models handle dynamic UI animations. To address this gap, we created AniMINT, a novel dataset of 300 densely annotated UI animation videos. We systematically evaluate state-of-the-art VLMs on UI animation understanding, including their abilities to perceive the animation effects, identify animation purposes, and interpret animation meaning. Our results show that VLMs can reliably detect primitive motion. However, their high-level animation interpretation remains inconsistent, with substantial gaps relative to human performance. Finally, we use Motion, Context, and Perceptual Cues (MCPC) to probe factors affecting VLM performance, revealing key bottlenecks and directions for future improvement.

Table of Contents

cs.CL [Back]

[1] Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing cs.CLPDF

[2] MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese cs.CL | cs.IRPDF

[3] CogRAG+: Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA cs.CLPDF

[4] LLMs Generate Kitsch cs.CLPDF

[5] Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs cs.CLPDF

[6] A Systematic Comparison of Prompting and Multi-Agent Methods for LLM-based Stance Detection cs.CLPDF

[7] Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens cs.CLPDF

[8] Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI cs.CL | cs.AI | cs.IRPDF

[9] Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation cs.CL | cs.AIPDF

[10] Multimodal LLMs are not all you need for Pediatric Speech Language Pathology cs.CLPDF

[11] SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling cs.CLPDF

[12] From Black-Box Confidence to Measurable Trust in Clinical AI: A Framework for Evidence, Supervision, and Staged Autonomy cs.CL | cs.AI | cs.CYPDF

[13] HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists cs.CL | cs.AI | cs.DLPDF

[14] ClawGym: A Scalable Framework for Building Effective Claw Agents cs.CL | cs.AI | cs.LGPDF

[15] Select to Think: Unlocking SLM Potential with Local Sufficiency cs.CLPDF

cs.CV [Back]

[16] Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding cs.CVPDF

[17] RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments cs.CVPDF

[18] FruitProM-V2: Robust Probabilistic Maturity Estimation and Detection of Fruits and Vegetables cs.CV | cs.AI | cs.ROPDF

[19] Sample Selection Using Multi-Task Autoencoders in Federated Learning with Non-IID Data cs.CV | cs.LGPDF

[20] Privacy-Preserving Clothing Classification using Vision Transformer for Thermal Comfort Estimation cs.CV | cs.CRPDF

[21] FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing cs.CV | cs.HC | cs.IR | cs.MMPDF

[22] ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection cs.CVPDF

[23] HOI-aware Adaptive Network for Weakly-supervised Action Segmentation cs.CVPDF

[24] DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation cs.CV | cs.AIPDF

[25] EnerGS: Energy-Based Gaussian Splatting with Partial Geometric Priors cs.CVPDF

[26] Camera-RFID Fusion for Robust Asset Tracking in Forested Environments cs.CVPDF

[27] MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution cs.CV | cs.AIPDF

[28] Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning cs.CVPDF

[29] OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction cs.CVPDF

[30] GaitKD: A Universal Decoupled Distillation Framework for Efficient Gait Recognition cs.CVPDF

[31] Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding cs.CVPDF

[32] Semantic Foam: Unifying Spatial and Semantic Scene Decomposition cs.CVPDF

[33] MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution cs.CV | cs.AIPDF

[34] CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation cs.CV | cs.AIPDF

[35] Motion-Driven Multi-Object Tracking of Model Organisms in Space Science Experiments cs.CVPDF

[36] SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness cs.CVPDF

[37] Which Face and Whose Identity? Solving the Dual Challenge of Deepfake Proactive Forensics in Multi-Face Scenarios cs.CVPDF

[38] ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance cs.CV | cs.AIPDF

[39] CO-EVO: Co-evolving Semantic Anchoring and Style Diversification for Federated DG-ReID cs.CV | cs.LGPDF

[40] Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning cs.CV | cs.LG | math.ATPDF

[41] A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection cs.CVPDF

[42] Decoupled Prototype Matching with Vision Foundation Models for Few-Shot Industrial Object Detection cs.CVPDF

[43] Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection cs.CVPDF

[44] Delineating Knowledge Boundaries for Honest Large Vision-Language Models cs.CV | cs.AIPDF

[45] Are Data Augmentation and Segmentation Always Necessary? Insights from COVID-19 X-Rays and a Methodology Thereof cs.CVPDF

[46] Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints cs.CVPDF

[47] Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation cs.CVPDF

[48] $\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding cs.CVPDF

[49] A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows cs.CVPDF

[50] Cross-Domain Transfer of Hyperspectral Foundation Models cs.CVPDF

[51] Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners cs.CV | cs.LGPDF

[52] Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models cs.CVPDF

[53] DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CVPDF

[54] AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision cs.CVPDF

[55] Star-Fusion: A Multi-modal Transformer Architecture for Discrete Celestial Orientation via Spherical Topology cs.CV | cs.AIPDF

[56] State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading cs.CVPDF

[57] SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection cs.CV | cs.AIPDF

[58] CurEvo: Curriculum-Guided Self-Evolution for Video Understanding cs.CV | cs.LGPDF

[59] Learning Sparse BRDF Measurement Samples from Image cs.CV | cs.GRPDF

[60] GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents cs.CVPDF

[61] TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection cs.CVPDF

[62] MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification cs.CV | cs.AIPDF

[63] Virtual-reality based patient-specific simulation of spine surgical procedures: A fast, highly automated and high-fidelity system for surgical education and planning cs.CVPDF

[64] Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization cs.CVPDF

[65] Uncertainty-Aware Pedestrian Attribute Recognition via Evidential Deep Learning cs.CVPDF

[66] Graph-based Semantic Calibration Network for Unaligned UAV RGBT Image Semantic Segmentation and A Large-scale Benchmark cs.CVPDF

[67] Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction cs.CVPDF

[68] World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning cs.CVPDF

[69] ProcFunc: Function-Oriented Abstractions for Procedural 3D Generation in Python cs.CVPDF

[70] Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation cs.CV | cs.ROPDF

cs.DB [Back]

[71] CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering cs.DB | cs.CLPDF

eess.IV [Back]

[72] Adaptive Transform Coding for Semantic Compression eess.IV | cs.CV | cs.IT | eess.SPPDF

cs.IR [Back]

[73] When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models cs.IR | cs.AI | cs.CLPDF

cs.SE [Back]