Table of Contents
- cs.CL [Total: 28]
- cs.CV [Total: 80]
- cs.IR [Total: 1]
- cs.AI [Total: 2]
- cs.CR [Total: 1]
- eess.IV [Total: 2]
- cs.RO [Total: 3]
- physics.optics [Total: 1]
- cs.LG [Total: 4]
cs.CL [Back]
[1] Visuospatial Perspective Taking in Multimodal Language Models cs.CL | cs.AIPDF
Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie
TL;DR: 本文评估了多模态语言模型在视觉空间视角采择能力上的表现,通过改编人类研究中的两个任务(Director Task和Rotating Figure Task),发现当前模型在需要抑制自身视角以采纳他人视角的Level 2 VPT上存在显著缺陷。
Details
Motivation: 随着多模态语言模型在社交和协作场景中的广泛应用,评估其视角采择能力变得至关重要,而现有基准主要依赖文本或静态场景理解,视觉空间视角采择尚未得到充分探索。
Result: 实验结果表明,多模态语言模型在Level 2 VPT任务中表现不佳,揭示了其在表示和推理替代视角方面的关键局限性。
Insight: 论文创新性地将人类心理学任务(如Director Task和Rotating Figure Task)转化为评估多模态模型视角采择能力的基准,强调了模型在协作场景中视角转换能力的不足,为未来模型设计提供了重要方向。
Abstract: As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one’s own perspective to adopt another’s. These results expose critical limitations in current MLMs’ ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.
[2] DISCO: Document Intelligence Suite for COmparative Evaluation cs.CL | cs.AI | cs.CVPDF
Kenza Benkirane, Dan Goldwater, Martin Asenov, Aneiss Ghodsi
TL;DR: DISCO是一个用于文档智能比较评估的套件,它分别评估OCR流水线和视觉语言模型在多种文档类型上的解析和问答性能,包括手写文本、多语言脚本、医疗表格、信息图和长文档。评估发现性能因任务和文档特征差异显著,OCR在处理手写和长文档时更可靠,而VLM在多语言和视觉丰富布局上表现更好,任务感知提示效果不一。
Details
Motivation: 解决文档智能中需要准确文本提取和可靠内容推理的问题,通过比较评估不同方法在不同文档类型上的表现,为选择文档处理策略提供依据。
Result: 在多种文档类型上的评估显示,OCR流水线在手写和长文档上更可靠,VLM在多语言文本和视觉丰富布局上表现更好,任务感知提示对某些文档类型有提升但对其他类型有下降,提供了基于文档结构和推理需求的实证指导。
Insight: 创新点在于提出了一个全面的比较评估套件,强调需要根据文档复杂性选择方法,揭示了OCR和VLM在不同场景下的互补性,以及提示工程的效果具有文档类型依赖性,为实际应用提供了数据驱动的策略选择见解。
Abstract: Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.
[3] Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data cs.CL | cs.AIPDF
John Cook, Michael Wyatt, Peng Wei, Iris Chin, Santosh Gupta
TL;DR: 本文研究利用基于电子健康记录生成的隐私保护合成临床数据,对Llama 3-70B大语言模型进行微调,以自动化医疗编码任务(ICD-10-CM和CPT代码分配)。实验表明,微调后模型在精确代码匹配上的F1分数从零样本基线的0.18大幅提升至超过0.70,且在复杂临床类别上表现良好,同时保持了医学理解能力。
Details
Motivation: 自动化医疗编码(如ICD-10-CM和CPT代码分配)面临记录异质性、编码指南细微差别和长尾分布等挑战,现有基础模型零样本性能不佳。本文旨在探索是否能用隐私保护的合成临床数据微调开源大模型,以实现专家级医疗编码,同时避免暴露受保护的健康信息。
Result: 在精确代码匹配任务上,未经微调的零样本基线F1分数为0.18;使用合成数据微调Llama 3-70B后,F1分数超过0.70,在ICD-10-CM和CPT两个编码系统上均取得显著绝对增益。模型在需要多步临床推理和代码组合的复杂类别(如晚期疾病和衰弱类)上保持高性能,并在医学理解任务上未出现性能下降。
Insight: 论文的创新点在于利用基于电子健康记录模板和编码策略生成的、隐私保护的合成数据来高效微调通用大语言模型,使其适应专业的医疗编码任务。这为在不暴露真实患者数据的情况下,安全、迭代地训练特定任务的编码代理提供了一条实用路径,同时证明了合成数据在教授模型复杂、政策敏感的领域知识方面的有效性。
Abstract: Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies, then evaluate exact-code prediction for ICD-10-CM and CPT. A zero-shot baseline with the unadapted model achieved an F1 score of 0.18 for exact code match. After fine-tuning on the synthetic corpus, exact-match F1 exceeded 0.70, representing a large absolute gain across both code systems. Notably, performance remained high on complex categories that often require multi-step clinical reasoning and code composition, including Advanced Illness and Frailty classes, and the model retained its performance on medical comprehension tasks. These results indicate that synthetic, policy-aware data can efficiently teach a general-purpose large language model to support precise medical coding without exposing protected health information. The approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations.
[4] MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens cs.CL | cs.AI | cs.IRPDF
Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li
TL;DR: 本文提出了一种名为MSA(Memory Sparse Attention)的新型端到端可训练内存模型框架,旨在解决大语言模型在处理超长上下文(如1亿个token)时面临的计算复杂度高、精度下降和延迟增加等问题。通过可扩展的稀疏注意力、文档级RoPE、KV缓存压缩和内存并行等核心技术,MSA实现了训练和推理的线性复杂度,并在扩展到1亿token时性能下降小于9%,且能在2块A800 GPU上完成推理。
Details
Motivation: 现有方法(如混合线性注意力、固定大小内存状态、RAG或智能体系统)在扩展LLM的有效上下文长度时,存在精度严重下降、延迟随上下文增长而快速增加、无法动态修改内存内容或缺乏端到端优化等问题,这阻碍了大规模语料摘要、数字孪生和长历史智能体推理等复杂场景的应用。
Result: 在长上下文基准测试中,MSA显著超越了前沿的LLM、最先进的RAG系统和领先的内存智能体。具体而言,在从16K token扩展到100M token时,性能下降小于9%,并能在2块A800 GPU上实现1亿token的推理。
Insight: 核心创新点包括:1)可扩展的稀疏注意力机制,实现了线性复杂度;2)文档级RoPE(Rotary Position Embedding),增强了位置编码的稳定性;3)KV缓存压缩与内存并行技术,大幅降低了硬件需求;4)内存交错(Memory Interleaving)机制,支持跨分散内存段的复杂多跳推理。这些创新将内存容量与推理过程解耦,为通用模型提供了可扩展的、终身规模的内存基础。
Abstract: Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.
[5] Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents cs.CL | cs.AIPDF
Peijun Qing, Puneet Mathur, Nedim Lipka, Varun Manjunatha, Ryan Rossi
TL;DR: 该论文提出了一种新的指令跟随聚类方法,通过将聚类任务重新定义为生成任务,并训练大型推理模型作为自主聚类代理,以同时解决通用嵌入模型无法遵循用户指令和指令调优嵌入器无法推断潜在语料库结构的局限性。
Details
Motivation: 通用嵌入模型能识别语义相似性但无法捕获用户指令指定的文本特征,而指令调优嵌入器虽能对齐嵌入与文本指令,却无法自主推断潜在语料库结构(如确定最佳聚类数量)。
Result: 在包含日常对话、法律案例和财务报告等28个多样化任务的ReasonCluster基准测试中,该方法在多种数据集和聚类场景下始终优于基于嵌入的强基线方法和大型推理模型基线,实现了更忠实和可解释的基于指令的聚类。
Insight: 创新点在于将指令跟随聚类重构为生成任务,利用大型推理模型的推理能力来解析高级聚类指令并推断潜在分组,从而提升了聚类的忠实性和可解释性;从客观角度看,该方法通过推理驱动训练,有效结合了语义理解和结构推断,为指令化聚类提供了新范式。
Abstract: General-purpose embedding models excel at recognizing semantic similarities but fail to capture the characteristics of texts specified by user instructions. In contrast, instruction-tuned embedders can align embeddings with textual instructions yet cannot autonomously infer latent corpus structures, such as determining the optimal number of clusters. To address both limitations, we reframe instruction-following clustering as a generative task and train large reasoning models (LRMs) as autonomous clustering agents. Our reasoning-driven training pipeline enables LRMs to interpret high-level clustering instructions and then infer the corresponding latent groupings. To evaluate this paradigm, we introduce ReasonCluster, a comprehensive benchmark comprising 28 diverse tasks spanning daily dialogue, legal cases, and financial reports. Experiments across diverse datasets and clustering scenarios show that our approach consistently outperforms strong embedding-based methods and LRM baselines, demonstrating that explicit reasoning fosters more faithful and interpretable instruction-based clustering.
[6] Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages cs.CL | cs.AI | cs.CVPDF
Shaharukh Khan, Ali Faraz, Abhinav Ravi, Mohd Nauman, Mohd Sarfraz
TL;DR: 本文介绍了Chitrakshara数据集系列,这是一个针对印度语言的大规模多语言多模态数据集,旨在解决现有视觉语言模型在印度语言上代表性不足的问题。该系列包括用于交错预训练的大规模数据集Chitrakshara-IL(包含1.93亿张图像、300亿文本标记和5000万份多语言文档)以及用于图像描述任务的Chitrakshara-Cap(包含4400万图像-文本对和7.33亿标记)。
Details
Motivation: 当前多模态研究主要集中于单图像推理,对多图像场景探索有限,且大多数视觉语言模型主要基于英语数据集训练,导致印度语言代表性不足。
Result: 论文未提及具体模型实验结果,但通过详细的数据收集流程和质量多样性分析,评估了数据集在印度语言中的代表性和开发更具文化包容性VLMs的潜力。
Insight: 创新点在于构建了首个覆盖11种印度语言的大规模多语言多模态数据集,通过从Common Crawl爬取数据并经过精心筛选处理,为多图像理解和跨文化多模态研究提供了重要资源,可促进更公平的AI模型发展。
Abstract: Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset’s representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.
[7] Qworld: Question-Specific Evaluation Criteria for LLMs cs.CL | cs.AIPDF
Shanghua Gao, Yuchang Su, Pengwei Sui, Curtis Ginder, Marinka Zitnik
TL;DR: 本文提出了Qworld方法,用于为开放式问题生成问题特定的评估标准。该方法通过递归扩展树结构,将问题分解为场景、视角和细粒度二元标准,从而覆盖每个问题隐含的评估维度。在HealthBench基准上,Qworld覆盖了89%的专家标准并生成了79%经专家验证的新标准,专家评价其标准更具洞察力和细粒度。在HealthBench和Humanity’s Last Exam数据集上对11个前沿大语言模型进行评估,Qworld揭示了在长期影响、公平性、错误处理和跨学科推理等维度上的能力差异,这些是粗粒度标准无法区分的。
Details
Motivation: 现有的大语言模型评估方法在开放式问题上存在局限,因为回答质量高度依赖问题上下文,而二元评分和静态标准无法捕捉这些上下文依赖的需求。现有方法通常在数据集级别定义标准或单次生成标准,限制了探索每个问题隐含评估空间的能力。
Result: 在HealthBench基准上,Qworld覆盖了89%的专家撰写标准,并生成了79%经人类专家验证的新标准。专家评价Qworld标准在洞察力和细粒度方面优于先前方法。在HealthBench和Humanity’s Last Exam上评估11个前沿LLM,Qworld揭示了在长期影响、公平性、错误处理和跨学科推理等维度上的模型能力差异。
Insight: 创新点在于将标准生成形式化为对问题隐含评估轴的结构化覆盖,通过递归扩展树实现层次化和水平化分解,使评估能够适应每个具体问题而非依赖固定的任务级标准。客观来看,该方法提供了一种动态、细粒度且可解释的评估框架,能够更全面地揭示模型在复杂开放式任务中的细微能力差异。
Abstract: Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question’s context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity’s Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.
[8] Do 3D Large Language Models Really Understand 3D Spatial Relationships? cs.CL | cs.ROPDF
Xianzheng Ma, Tao Sun, Shuai Chen, Yash Bhalgat, Jindong Gu
TL;DR: 这篇论文质疑现有3D大语言模型(3D-LLMs)是否真正理解3D空间关系,指出仅使用文本问答对微调的语言模型在SQA3D基准上表现可比甚至超越现有3D-LLMs,表明该基准存在缺陷。为此,作者提出了更严格的评估基准Real-3DQA,以过滤简单问题并系统评估3D推理能力。实验证实现有模型在去除简单线索后难以处理空间关系,作者进一步提出3D重加权训练目标,引导模型更多依赖3D视觉线索,显著提升了模型在空间推理任务上的性能。
Details
Motivation: 现有3D-LLMs声称理解3D世界,尤其是物体间的空间关系,但作者发现SQA3D基准可能无法检测模型是否利用了文本捷径而非真正的3D感知推理,因此需要更鲁棒的基准和训练策略来推动真正的3D视觉语言理解。
Result: 在提出的Real-3DQA基准上,实验证实现有3D-LLMs在简单线索被移除后难以处理空间关系;而提出的3D重加权训练目标显著提升了模型在空间推理任务上的性能。
Insight: 论文的创新点在于揭示了现有3D评估基准的局限性,并提出了一个更严谨的基准Real-3DQA(包含结构化分类法)以及一种3D重加权的训练目标,以强制模型依赖3D视觉线索,这对于推动真正的3D场景理解具有重要借鉴意义。
Abstract: Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that guides model to rely more on 3D visual clues, substantially enhancing 3D-LLMs performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding. Project page: https://real-3dqa.github.io/.
[9] Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction cs.CLPDF
Özgür Togay, Florian Kunneman, Javier Garcia-Bernardo, Anastasia Giachanou
TL;DR: 本文研究了大型语言模型(LLMs)在目标-立场提取(TSE)任务上的能力,该任务结合了目标识别和立场检测,用于对在线政治讨论进行细粒度分析。作者构建了一个包含1084个Reddit帖子的数据集,评估了多种专有和开源LLM在零样本、少样本和上下文增强提示策略下的表现,发现最佳模型的表现与训练有素的人类标注者相当,并且在标注者间一致性低的困难帖子上也保持稳健。
Details
Motivation: 当前计算分析常将政治话语简化为粗略的党派标签,忽略了关于政策、人物和议题的复杂信念互动。在线政治对话尤其微妙且主题广泛,难以自动识别讨论目标和对其表达的观点。本文旨在探索LLMs能否通过TSE任务应对这一挑战。
Result: 在从r/NeutralPolitics构建的数据集(涵盖138个不同政治目标)上,最佳LLM模型的表现与高度训练的人类标注者相当,并且在标注者间一致性低的挑战性帖子上保持稳健。
Insight: 论文的创新点在于将TSE任务应用于复杂政治意见分析,并系统评估了LLMs在此任务上的潜力。客观来看,其展示了LLMs在最小监督下提取复杂政治观点的能力,为计算社会科学和政治文本分析提供了一个可扩展的工具,特别是在处理细粒度、多目标的政治话语方面。
Abstract: Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.
[10] Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks cs.CL | cs.AIPDF
Fatih Uenal
TL;DR: 本文介绍了瑞士法律与监管任务基准Swiss-Bench SBP-002,这是一个包含395个专家编写项目、覆盖三个监管领域、七种任务类型和三种语言的三语基准。作者评估了2026年3月的十个前沿模型,结果显示所有模型在该基准上表现均不理想,最高正确率仅为38.2%,并根据性能将模型分为三个层级。
Details
Motivation: 现有基准缺乏对前沿模型在瑞士实际监管合规任务上性能的评估,本文旨在填补这一空白,为零检索条件下评估模型在瑞士特定法律与监管领域的能力提供实证参考。
Result: 在零检索条件下,十个前沿模型的整体表现较差,最高正确率模型(Qwen 3.5 Plus)仅达到38.2%的正确率。模型性能可分为三个层级:Tier A(35-38%正确)、Tier B(26-29%)、Tier C(13-21%)。任务难度差异显著,法律翻译和案例分析正确率较高(69-72%),而监管问答、幻觉检测和差距分析正确率极低(<9%)。在评估的模型中,开源模型表现领先,部分开源模型性能匹配或超越了闭源模型。
Insight: 论文的创新点在于构建了首个专注于瑞士多语种、多领域实际监管合规任务的基准(Swiss-Bench SBP-002),并采用了结构化的三维评分框架和由LLM组成的盲审法官小组进行多数投票聚合的评估方法。客观来看,该研究揭示了当前前沿大语言模型在复杂、专业的监管任务上仍存在显著能力缺口,特别是特定任务类型(如监管问答)表现极差,同时挑战了闭源模型普遍优于开源模型的假设,为领域特定的模型评估提供了新的方法论和基准数据集。
Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35-38% correct), Tier B (26-29%), and Tier C (13-21%). The benchmark proves difficult: even the top-ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69-72% correct rates, while regulatory Q&A, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open-weight, three closed-source), an open-weight model leads the ranking, and several open-weight models match or outperform their closed-source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions.
[11] PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay cs.CL | cs.AIPDF
Rohan Khetan, Ashna Khetan
TL;DR: 该论文提出了PoliticsBench,一种基于多轮角色扮演的新型基准测试框架,用于评估大型语言模型(LLMs)的政治价值观偏差。研究测试了八个主流LLM,发现其中七个模型表现出左倾政治偏见,而Grok模型则右倾,并揭示了模型在推理方式上的差异。
Details
Motivation: 现有LLM社会偏见基准主要关注性别和种族刻板印象,对政治偏见的评估较为粗略,缺乏对塑造社会政治倾向的具体价值观的细致衡量。本研究旨在通过多轮角色扮演,更精细地评估LLMs中潜在的政治价值观偏差及其对客观性的影响。
Result: 在PoliticsBench基准上对八个LLM(Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned)的测试结果显示,七个模型左倾,Grok右倾。左倾模型强烈表现出自由主义特质,同时适度表现出保守主义特质。在多阶段角色扮演中,对齐分数仅有轻微变化且无特定模式。
Insight: 创新点在于首次通过多阶段、自由文本交互的心理测量学方法评估LLMs的政治价值观,并提出了一个能捕捉具体政治价值取向的细粒度基准框架。这为理解和量化LLMs中更复杂的社会偏见提供了新工具和方法论视角。
Abstract: While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten political values, we explored the values underlying chatbots’ deviations from unbiased standards. Seven of our eight models leaned left, while Grok leaned right. Each left-leaning LLM strongly exhibited liberal traits and moderately exhibited conservative ones. We discovered slight variations in alignment scores across stages of roleplay, with no particular pattern. Though most models used consequence-based reasoning, Grok frequently argued with facts and statistics. Our study presents the first psychometric evaluation of political values in LLMs through multi-stage, free-text interactions.
[12] Language Model Planners do not Scale, but do Formalizers? cs.CLPDF
Owen Jiang, Cassie Huang, Ashish Sabharwal, Li Zhang
TL;DR: 本文探讨了LLM在规划问题中的表现,发现LLM形式化器(formalizers)比LLM规划器(planners)更具扩展性,能在BlocksWorld等复杂领域保持高精度。针对形式化器性能随问题复杂度下降的问题,提出了分治法改进其鲁棒性,并针对描述与形式语言(如PDDL)间存在指数级差距的“解构问题”,引入了LLM作为高阶形式化器的新范式,即让LLM生成程序生成器,以解耦令牌输出与底层形式化和搜索空间的组合爆炸。
Details
Motivation: 针对现有研究显示LLM在解决复杂规划问题时表现不佳,本文旨在探究LLM形式化器(生成面向求解器的程序)是否面临同样的问题,并寻求提升其扩展性和鲁棒性的方法。
Result: 在经典的BlocksWorld领域(状态空间高达10^165)中,LLM形式化器显著优于LLM规划器,部分模型能保持完美准确率。分治法能大幅提升较小LLM形式化器的鲁棒性。
Insight: 创新点在于系统性地证明了LLM形式化器比规划器更具扩展性,提出了分治法提升鲁棒性,并针对“解构问题”引入了LLM作为高阶形式化器的新范式,通过生成程序生成器来解耦令牌输出与组合爆炸,这为处理复杂形式化任务提供了新思路。
Abstract: Recent work shows overwhelming evidence that LLMs, even those trained to scale their reasoning trace, perform unsatisfactorily when solving planning problems too complex. Whether the same conclusion holds for LLM formalizers that generate solver-oriented programs remains unknown. We systematically show that LLM formalizers greatly out-scale LLM planners, some retaining perfect accuracy in the classic BlocksWorld domain with a huge state space of size up to $10^{165}$. While performance of smaller LLM formalizers degrades with problem complexity, we show that a divide-and-conquer formalizing technique can greatly improve its robustness. Finally, we introduce unraveling problems where one line of problem description realistically corresponds to exponentially many lines of formal language such as the Planning Domain Definition Language (PDDL), greatly challenging LLM formalizers. We tackle this challenge by introducing a new paradigm, namely LLM-as-higher-order-formalizer, where an LLM generates a program generator. This decouples token output from the combinatorial explosion of the underlying formalization and search space.
[13] BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents cs.CL | cs.CYPDF
Praveen Kumar Myakala, Manan Agrawal, Rahul Manche
TL;DR: 论文提出了一个名为BeliefShift的纵向基准测试,专门用于评估大型语言模型在多轮会话中的信念动态变化,包括信念一致性、矛盾检测和证据驱动修正三个任务。该基准包含2400条人工标注的多会话交互轨迹,覆盖健康、政治等主题。作者评估了包括GPT-4o在内的七个模型,揭示了模型在个性化与抵抗信念漂移之间的权衡,并引入了四个新的评估指标。
Details
Motivation: 现有基准测试将用户信息视为静态事实进行存储和检索,而忽略了在长期对话中用户观点会改变的现实,如观点漂移、过度对齐和确认偏见等现象。因此,需要一个新的基准来评估LLM代理在动态信念变化方面的表现。
Result: 在零样本和检索增强生成设置下评估了七个模型(如GPT-4o、Claude 3.5 Sonnet等)。结果显示了一个明确的权衡:积极个性化的模型抵抗信念漂移能力差,而基于事实的模型则难以捕捉合理的信念更新。
Insight: 创新点在于首次提出了一个专注于LLM信念动态变化的纵向基准测试,并引入了四个新的评估指标(如信念修正准确率、漂移一致性分数等),以量化模型在真实世界观点变化场景中的表现,弥补了现有静态评估的不足。
Abstract: LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That’s the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).
[14] Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development cs.CL | cs.LGPDF
Zongliang Ji, Ziyang Zhang, Xincheng Tan, Matthew Thompson, Anna Goldenberg
TL;DR: 本研究探讨了利用大型语言模型(LLM)作为环境助手,在医患对话中自动生成基于循证医学指南的问题的可行性。研究采用Gemini 2.5模型,比较了零样本提示和多阶段推理两种策略,并在80份真实临床对话转录本上进行了评估。结果表明,尽管通用LLM尚未完全可靠,但已能生成具有临床意义和指南相关性的问题,显示出减轻医生认知负担、促进循证医学在诊疗点应用的潜力。
Details
Motivation: 解决在快节奏的初级诊疗环境中,医生因咨询时间短、患者负荷大、指南文档冗长而难以实时查阅和遵循循证医学指南的实践难题。
Result: 在80份真实临床对话转录本构成的基准测试上,由六位经验丰富的医生进行了超过90小时的结构化评审。结果表明,模型能够生成具有临床意义和指南相关性的问题,但通用LLM尚未达到完全可靠的水平。
Insight: 创新点在于将LLM的应用从传统的问答任务转向问题生成任务,旨在为医生的临床推理提供支架,并将基于指南的实践整合到简短咨询中。研究验证了使用LLM作为环境助手来即时生成针对性问题的可行性,为开发基于证据的医疗指南智能体提供了新的技术路径。
Abstract: Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.
[15] OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models cs.CLPDF
Seunghee Kim, Bumkyu Park, Kyudan Jung, Joosung Lee, Soyoon Kim
TL;DR: 本文提出了OmniACBench基准测试,用于评估全模态模型在上下文驱动的声学控制能力,即模型能否根据语音指令、文本脚本和图像,以合适的语调和方式朗读脚本。该基准包含3,559个已验证实例,覆盖语速、发声、发音、情感、全局口音和音色六个声学特征。通过对八个模型的广泛实验,发现尽管这些模型在之前的文本输出评估中表现良好,但在该设置下存在明显局限,主要瓶颈在于整合多模态上下文以生成忠实语音的能力不足。
Details
Motivation: 现有全模态模型测试平台主要通过文本输出来评估多模态理解能力,但无法确定模型是否能正确“说出”答案,因此需要专门评估其声学控制性能。
Result: 在OmniACBench上对八个模型的实验表明,它们在上下文驱动的声学控制任务中表现受限,揭示了与文本输出评估的性能差距,并识别出三种常见失败模式。
Insight: 创新点在于构建了首个专注于评估全模态模型声学控制能力的基准,强调多模态上下文整合对忠实语音生成的关键作用,并系统分析了弱直接控制、隐式推理失败和多模态基础失败等瓶颈,为开发能有效言语化响应的模型提供了方向。
Abstract: Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.
[16] From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents cs.CLPDF
Sirui Xia, Yikai Zhang, Aili Chen, Siye Wu, Siyu Yuan
TL;DR: 论文提出了POISE框架,这是一个用于自动化发现语言模型策略优化算法的闭环系统。它通过维护一个结构化的、具有谱系关联的档案库,将算法提案、可执行实现、标准化评估和自然语言反思联系起来,以支持基于证据的迭代。在从GRPO算法开始的数学推理实验中,POISE评估了64个候选算法,并发现了改进的机制,例如解析方差缩放和有效性掩码。
Details
Motivation: 为语言模型发现改进的策略优化算法目前仍是一个成本高昂的手动过程,需要反复进行机制层面的修改和验证。本文旨在解决这一自动化搜索难题,该问题需要在与训练动态紧密耦合的算法机制空间中进行搜索,并跨迭代重用经验证据。
Result: 在数学推理实验中,POISE发现的最佳算法变体将加权总体得分从47.8提升至52.5(+4.6),并将AIME25 pass@32从26.7%提升至43.3%,证明了自动化策略优化发现的可行性。
Insight: 论文的核心创新点在于提出了一个支持证据驱动迭代的闭环自动化发现框架(POISE),其结构化档案库设计支持算法谱系追踪和知识复用。从客观角度看,将算法发现过程形式化为一个可搜索、可解释的闭环系统,并成功应用于发现具有实际性能提升的LLM-RL算法新机制,是该方法的主要贡献。
Abstract: Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.
[17] The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More cs.CL | cs.AI | cs.GT | cs.LG | cs.MAPDF
Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia
TL;DR: 本文首次系统研究了推理语言模型(RLM)的API标价与实际推理成本之间的差异,发现存在‘定价反转’现象:在21.8%的模型对比较中,标价更低的模型实际总成本反而更高,反转幅度最高可达28倍。例如,Gemini 3 Flash的标价比GPT-5.2便宜78%,但其在所有任务上的实际成本却高出22%。研究指出,根本原因在于‘思考令牌’消耗的巨大异质性,且同一查询的思考令牌消耗存在高达9.7倍的波动,使得成本预测极为困难。
Details
Motivation: 开发者和消费者通常根据API标价选择推理语言模型,但标价是否准确反映实际推理成本尚不明确。本文旨在系统探究标价与实际成本之间的关系,揭示潜在的不匹配问题。
Result: 在涵盖竞赛数学、科学问答、代码生成和多领域推理的9个任务上评估了8个前沿RLM。研究发现定价反转现象普遍存在,且去除思考令牌成本后,价格与成本排名的相关性(Kendall’s τ)从0.563提升至0.873。成本预测的噪声下限由思考令牌消耗的高波动性(最高9.7倍)决定。
Insight: 创新点在于首次系统量化了RLM标价与实际成本的脱节,揭示了‘思考令牌’消耗异质性是导致‘定价反转’的关键因素。这强调了基于成本的模型选择和透明的按请求成本监控的重要性,为API定价透明度和成本预测研究提供了新方向。
Abstract: Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash’s listed price is 78% cheaper than GPT-5.2’s, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall’s $τ$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.
[18] Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning cs.CLPDF
Kun-Yang Yu, Zhi Zhou, Shi-Yu Tian, Xiao-Wen Yang, Zi-Yi Jia
TL;DR: 本文提出了Thinking with Tables (TWT)方法,旨在解决多模态大语言模型在表格-视觉理解任务中的挑战。该方法采用基于代码的神经符号推理机制,通过与环境交互来处理表格的结构可变性、特征依赖性和任务异构性。在八个数据集上的实验表明,TWT平均准确率超越现有基线10%,性能媲美甚至超越专有商业SOTA模型。
Details
Motivation: 多模态大语言模型在图像和文本上表现出色,但对表格数据这一关键现实模态的研究相对不足。表格数据存在结构多变、数据不完整、特征依赖复杂以及下游任务解决流程异构三大核心挑战,本文旨在解决这些问题。
Result: 在八个代表性数据集上的实验结果显示,TWT在准确率上平均超越现有基线10%,在表格-视觉多模态理解任务上达到了与专有商业SOTA大语言模型相当甚至超越的性能水平。
Insight: 论文的核心创新点是提出了一种程序辅助的、基于代码的神经符号推理机制,通过与环境交互来执行信息提取和元素建模等关键操作,从而系统性地应对表格数据的固有挑战。从客观角度看,将神经网络的感知能力与符号推理的逻辑性结合,为解决复杂、异构的表格理解任务提供了一种可借鉴的新范式。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities across modalities such as images and text. However, tabular data, despite being a critical real-world modality, remains relatively underexplored in multimodal learning. In this paper, we focus on the task of Tabular-Vision Multi-Modal Understanding (TVMU) and identify three core challenges: (1) high structural variability and data incompleteness in tables, (2) implicit and complex feature dependencies, and (3) significant heterogeneity in problem-solving pipelines across downstream tasks. To address these issues, we propose Thinking with Tables (TWT). TWT employs a program-aided code-based neuro-symbolic reasoning mechanism that facilitates key operations, such as information extraction and element modeling, by interacting with external environments. We evaluate TWT on eight representative datasets. Experimental results demonstrate that TWT consistently outperforms existing baselines by an average of 10% in accuracy, achieving performance comparable to, or even surpassing, proprietary commercial SOTA LLMs on TVMU tasks. Models and codes are available at https://github.com/kunyang-YU/Thinking-with-Tables
[19] CVPD at QIAS 2026: RAG-Guided LLM Reasoning for Al-Mawarith Share Computation and Heir Allocation cs.CLPDF
Wassim Swaileh, Mohammed-En-Nadhir Zighem, Hichem Telli, Salah Eddine Bekhouche, Abdellah Zakaria Sellam
TL;DR: 本文提出了一种基于检索增强生成(RAG)的管道,用于解决伊斯兰继承法(Ilm al-Mawarith)中的多阶段法律推理任务,包括继承人识别、份额计算与分配等复杂环节。该方法结合了基于规则的合成数据生成、混合检索与重排序以及模式约束的输出验证,在QIAS 2026盲测排行榜上取得了最佳成绩。
Details
Motivation: 伊斯兰继承法推理是一个复杂的多阶段任务,涉及多种规则和调整,且因法学流派和法典化差异而存在变体,需要模型在明确的法律配置下进行高精度、可靠的推理。
Result: 所提出的系统在官方QIAS 2026盲测排行榜上排名第一,取得了0.935的MIR-E分数,表明其在阿拉伯语法律推理任务中的高可靠性。
Insight: 创新点在于为特定领域(伊斯兰继承法)构建了一个端到端的RAG管道,其核心是利用符号计算器生成带有完整推理链的高质量合成数据来增强检索,并结合混合检索与模式约束验证,显著提升了复杂规则推理的准确性和可靠性。
Abstract: Islamic inheritance (Ilm al-Mawarith) is a multi-stage legal reasoning task requiring the identification of eligible heirs, resolution of blocking rules (hajb), assignment of fixed and residual shares, handling of adjustments such as awl and radd, and generation of a consistent final distribution. The task is further complicated by variations across legal schools and civil-law codifications, requiring models to operate under explicit legal configurations. We present a retrieval-augmented generation (RAG) pipeline for this setting, combining rule-grounded synthetic data generation, hybrid retrieval (dense and BM25) with cross-encoder reranking, and schema-constrained output validation. A symbolic inheritance calculator is used to generate a large high-quality synthetic corpus with full intermediate reasoning traces, ensuring legal and numerical consistency. The proposed system achieves a MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard. Results demonstrate that retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks.
[20] ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing cs.CLPDF
Yu-Chen Kang, Yu-Chien Tang, An-Zi Yen
TL;DR: 本文提出了ConceptKT基准,用于知识追踪中的概念级缺陷预测,通过标注数据集和评估大语言模型及大推理模型的诊断能力,探索了基于概念对齐和语义相似性的历史记录选择策略,以提升正确性预测和概念级缺陷识别的性能。
Details
Motivation: 解决传统知识追踪系统仅关注二元正确性预测而无法诊断导致错误的潜在概念误解的问题,以提供细粒度诊断反馈支持个性化教学和有效补救。
Result: 实验结果表明,基于概念对齐和语义相似性选择响应历史记录的方法在正确性预测和概念级缺陷识别上均提高了性能。
Insight: 创新点在于引入概念级缺陷预测任务并构建相应标注数据集,通过探索大模型在上下文学习中的诊断能力,提出历史记录选择策略以增强知识追踪的细粒度分析。
Abstract: Knowledge Tracing (KT) is a critical technique for modeling student knowledge to support personalized learning. However, most KT systems focus on binary correctness prediction and cannot diagnose the underlying conceptual misunderstandings that lead to errors. Such fine-grained diagnostic feedback is essential for designing targeted instruction and effective remediation. In this work, we introduce the task of concept-level deficiency prediction, which extends traditional KT by identifying the specific concepts a student is likely to struggle with on future problems. We present ConceptKT, a dataset annotated with labels that capture both the concepts required to solve each question and the missing concepts underlying incorrect responses. We investigate in-context learning approaches to KT and evaluate the diagnostic capabilities of various Large Language Models (LLMs) and Large Reasoning Models (LRMs). Different strategies for selecting informative historical records are explored. Experimental results demonstrate that selecting response histories based on conceptual alignment and semantic similarity leads to improved performance on both correctness prediction and concept-level deficiency identification.
[21] GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents cs.CL | cs.AI | cs.CVPDF
Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi
TL;DR: 本文提出了GameplayQA,一个用于评估多模态大语言模型在3D虚拟环境中作为自主代理感知与推理能力的基准框架。该框架基于密集标注的多玩家3D游戏视频,构建了包含状态、动作和事件的时间同步并发描述,并生成了三层认知复杂度的诊断性问答对,以测试模型在决策密集、第一人称视角同步多视频理解中的表现。
Details
Motivation: 现有基准无法充分评估多模态大语言模型在3D环境中作为自主代理所需的关键能力,如感知快速状态变化、将动作正确归因于实体以及从第一人称视角推理并发多智能体行为。
Result: 在GameplayQA基准上评估前沿多模态大语言模型,发现其性能与人类表现存在显著差距,常见失败包括时间和跨视频定位、智能体角色归因以及处理游戏决策密度等方面。
Insight: 创新点在于提出了一个围绕自我、其他智能体和世界的三元系统来自然分解多智能体环境,并构建了结构化的干扰项分类法,用于细粒度分析模型幻觉来源。这为具身AI、代理感知和世界建模的交叉研究提供了新的评估工具和方向。
Abstract: Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.
[22] Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning cs.CLPDF
Arsen Shebzukhov
TL;DR: 本文提出了一种通过循环一致性微调来改进Lean4自动形式化的方法。作者使用LoRA对Qwen3.5-2B模型在FineLeanCorpus上进行自然语言到Lean4形式化的微调,并比较了三种训练策略:带课程学习的监督微调、无课程学习的监督微调,以及使用基于循环一致性奖励的组相对策略优化的强化学习。实验表明,强化学习方法在保持形式化质量的同时,在循环一致性指标上显著优于监督微调方法。
Details
Motivation: 自动形式化(将自然语言数学文本自动翻译为如Lean4的形式化证明语言)可以加速AI辅助的数学研究(如证明验证或搜索)。本文旨在探索更有效的微调策略来提升这一翻译过程的语义保真度。
Result: 在FineLeanCorpus的未见子集和PutnamBench上,使用循环一致性奖励的强化学习方法(RL)在平均循环一致性指标上大幅优于两种监督微调(SFT)变体(FLC上0.669 vs. 0.513;PutnamBench上0.561 vs. 0.422),同时仅使交叉熵损失增加0.011 nats,对形式化质量影响极小。课程学习相比随机排序训练未带来可测量的收益。
Insight: 主要创新点在于将循环一致性(通过自然语言到Lean4再回译到自然语言的循环,利用现成句子嵌入的余弦相似度计算)作为强化学习的奖励信号,以提升翻译的语义保真度。从客观角度看,这是一种将无监督或自监督的语义一致性目标与传统有监督微调及强化学习相结合的有效方法,为改进自动形式化任务提供了新思路。
Abstract: Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL’ loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.
[23] Towards Reward Modeling for AI Tutors in Math Mistake Remediation cs.CLPDF
Kseniia Petukhova, Ekaterina Kochmar
TL;DR: 本文针对数学错误纠正场景中AI导师的教学质量评估难题,提出了一种基于人类偏好建模的奖励模型方法。通过构建包含教学关键维度的层次化评估框架,并合成最小对比响应对,训练了基于Bradley-Terry模型的偏好预测器。实验表明,仅使用合成数据的模型在人类偏好测试中达到0.69准确率,结合加权排名数据后提升至0.74,优于更大规模的通用奖励模型。
Details
Motivation: 现有自然语言生成指标无法评估AI导师在错误识别、推理引导、答案隐藏等教学维度的质量,需要建立针对数学错误纠正任务的细粒度教学评估体系。
Result: 在MRBench人类偏好测试集上,仅用合成数据的模型达到0.69配对准确率,结合加权排名数据后提升至0.74准确率,使用0.5B参数骨干网络即超越更大规模通用奖励模型。
Insight: 通过教学维度解构合成最小对比样本,结合自动生成的加权排名数据,实现了数据高效的教学质量奖励建模;层次化教学评估框架为教育AI的细粒度评估提供了可扩展方法论。
Abstract: Evaluating the pedagogical quality of AI tutors remains challenging: standard NLG metrics do not determine whether responses identify mistakes, scaffold reasoning, or avoid revealing the answers. For the task of mistake remediation, we derive a hierarchy of pedagogical aspects from human pairwise preferences on MRBench, and synthesize minimally contrastive response pairs that differ along key aspects (e.g., mistake identification and location, targetedness, scaffolding, actionability, clarity, and coherence). We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations. Using only synthetic data, our best model reaches 0.69 pairwise accuracy on a human preference test, and combining weighted-sum data with targeted synthetic groups improves accuracy to 0.74, outperforming larger general-purpose reward models while using only a 0.5B-parameter backbone.
[24] When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools cs.CL | cs.AI | cs.CYPDF
Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao
TL;DR: 本文提出了一种基于大语言模型(LLM)的AI评估框架Interaction2Eval,用于对中国幼儿园中教师-儿童互动(TCI)进行大规模、可扩展的质量评估。研究构建了首个大规模自然情境下的中文幼儿园TCI数据集TEPE-TCI-370h,并验证了该框架在提取结构化质量指标方面与人类专家评估具有高达88%的一致性,同时将评估工作效率提升了18倍,为实现从年度人工审计到月度AI辅助监测的范式转变提供了技术基础。
Details
Motivation: 传统基于专家的教师-儿童互动质量评估方法成本高、耗时长,在中国庞大的学前教育体系(覆盖3600万儿童、25万+幼儿园)中难以实现持续的质量监测,导致评估仅限于不频繁的抽查,限制了及时干预和改进追踪。本文旨在探索AI能否作为可扩展的评估伙伴,以解决这一规模化挑战。
Result: 在构建的TEPE-TCI-370h数据集上,提出的Interaction2Eval框架与人类专家评估的一致性达到了88%。在43个教室的实际部署验证中,评估工作流程的效率提升了18倍,展示了从年度专家审计转向月度AI辅助监测(辅以针对性人工监督)的潜力。
Insight: 创新点包括:1)构建了首个大规模、自然情境下的中文幼儿园教师-儿童互动标注数据集;2)开发了专门的LLM框架,解决了儿童语音识别、普通话同音词消歧和基于评估量表的推理等特定领域挑战;3)实证了AI辅助评估在提升效率和实现持续监测方面的可行性,为学前教育质量评估范式转变奠定了基础。
Abstract: High-quality teacher-child interaction (TCI) is fundamental to early childhood development, yet traditional expert-based assessment faces a critical scalability challenge. In large systems like China’s-serving 36 million children across 250,000+ kindergartens-the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking. In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments. Our contributions include: (1) TEPE-TCI-370h (Tracing Effective Preschool Education), the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM-based framework addressing domain-specific challenges-child speech recognition, Mandarin homophone disambiguation, and rubric-based reasoning-achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI-assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI-augmented quality assessment but also lays the foundation for a new paradigm in early childhood education-one where continuous, inclusive, AI-assisted evaluation becomes the engine of systemic improvement and equitable growth.
[25] Mechanic: Sorrifier-Driven Formal Decomposition Workflow for Automated Theorem Proving cs.CLPDF
Ruichen Qiu, Yichuan Cao, Junqi Liu, Dakai Guo, Xiao-Shan Gao
TL;DR: 本文提出了一种名为Mechanic的新型智能体系统,用于自动定理证明。该系统采用基于’sorry’占位符的形式化分解策略,将证明失败时未解决的子目标精确隔离并提取为独立的、自包含的上下文进行独立求解,从而避免了完全重新生成证明的低效性以及反复修复导致上下文过长的注意力退化问题。
Details
Motivation: 当前基于大语言模型的自动定理证明系统在处理需要复杂数学推理的问题时,首次尝试很少成功,需要反复修改证明策略。现有处理失败尝试的方法要么低效地完全丢弃并重新生成证明,要么在保留先前进展的同时进行迭代修复,但后者会导致上下文不断增长,损害模型对剩余未解决子问题的关注能力。
Result: 在具有挑战性的数学竞赛基准测试(包括IMO 2025和Putnam 2025)上的实验结果表明,该智能体在证明效率方面取得了显著优势。
Insight: 核心创新点是利用定理证明器Lean中的’sorry’占位符作为精确的分解工具,实现了对已验证证明结构的保留和对未解决子目标的干净隔离。这种策略在自动定理证明领域提供了一种新颖的、介于完全重试和原地迭代修复之间的高效工作流,有效平衡了计算资源利用和模型注意力范围。从客观角度看,将形式化证明中的占位符机制系统性地整合到智能体决策循环中,是一个巧妙的工程与理论结合点。
Abstract: Recent advances in large language models (LLMs) and LLM-based agents have substantially improved the capabilities of automated theorem proving. However, for problems requiring complex mathematical reasoning, current systems rarely succeed on the first try and must repeatedly modify their proof strategies. Existing approaches for handling failed attempts typically either discard the entire proof and regenerate it from scratch or iteratively fix errors within the proof. The former is inefficient, as it may abandon mostly correct reasoning due to localized errors, while the latter, although preserving prior progress, leads to progressively longer contexts which progressively degrades the model’s ability to attend to the remaining unresolved subproblems. To address this dilemma, we propose Mechanic, a novel agent system that employs a sorry-driven formal decomposition strategy. By leveraging the sorry placeholder in Lean to precisely isolate unresolved subgoals while preserving the surrounding verified proof structure, Mechanic extracts each failed subproblem into a clean, self-contained context and resolves it independently. This avoids both the waste of full regeneration and the excessive context length induced by repeated repairs. Experimental results on challenging mathematical competition benchmarks, including IMO 2025 and Putnam 2025, demonstrate that our agent achieves significant advantages in proving efficiency.
[26] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? cs.CL | cs.LGPDF
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim
TL;DR: 本文研究发现,在数学推理任务中,自蒸馏(Self-Distillation)虽然能缩短模型的推理轨迹,但有时会损害其推理能力,导致性能下降。性能下降的根本原因在于自蒸馏抑制了模型在推理过程中表达不确定性的能力(即认知言语化),这虽然有助于在有限任务覆盖范围内进行快速域内优化,但损害了分布外(OOD)泛化性能。
Details
Motivation: 自蒸馏作为LLMs的一种有效后训练范式,通常能提升性能并缩短推理轨迹,但在数学推理中却观察到性能下降。本文旨在探究这种性能下降的根本原因。
Result: 在Qwen3-8B、DeepSeek-Distill-Qwen-7B和Olmo3-7B-Instruct等模型上的实验表明,自蒸馏可能导致性能下降高达40%。研究发现,当教师模型基于丰富信息进行条件化时,会抑制不确定性表达,从而损害OOD性能。
Insight: 核心创新点在于将自蒸馏的性能下降归因于其对“认知言语化”(模型在推理中表达不确定性的能力)的抑制。关键见解是:在推理任务中,暴露适当水平的不确定性对于鲁棒性至关重要,优化目标不应仅仅是强化正确答案的轨迹,还应考虑优化推理行为本身。
Abstract: Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model’s expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
[27] Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation cs.CL | cs.HCPDF
Soufiane Jhilal, Martina Galletti
TL;DR: 本文提出了一种多语言AI驱动的文本转象形图系统,旨在为有特殊教育需求和残疾(SEND)的儿童提供阅读康复支持。该系统能自动识别文本中的关键概念,并将其映射到上下文相关的象形图上,以提供视觉辅助。研究在英语、法语、意大利语、西班牙语和阿拉伯语五种语言上进行了评估,结果显示系统具有高象形图覆盖率、语义准确性以及满足实时交互的延迟性能。
Details
Motivation: 解决有特殊教育需求和残疾(SEND)儿童在阅读理解方面面临的挑战,他们通常需要密集的一对一阅读支持。为了帮助治疗师扩大这种支持的规模,研究旨在开发一个自动化的、多语言的视觉辅助系统。
Result: 在五种类型学不同的语言(英语、法语、意大利语、西班牙语和阿拉伯语)上进行了评估。结果显示,系统在所有语言中都实现了高象形图覆盖率和视觉辅助密度。专家审核表明,自动选择的象形图在语义上是合适的,四种欧洲语言的正确和可接受评级合计超过95%,阿拉伯语约为90%。系统延迟保持在适合实时教育使用的交互阈值内。
Insight: 创新点在于开发了一个多语言的、AI驱动的文本到象形图的动态映射系统,用于阅读康复的可扩展视觉辅助。从客观角度看,其创新之处在于将多语言处理与特定领域的象形图库结合,并系统性地在多种语言和临床专业视角下验证了其技术可行性、语义安全性和可接受性,为神经多样性学习者的可访问性工具提供了实证支持。
Abstract: Reading comprehension presents a significant challenge for children with Special Educational Needs and Disabilities (SEND), often requiring intensive one-on-one reading support. To assist therapists in scaling this support, we developed a multilingual, AI-powered interface that automatically enhances text with visual scaffolding. This system dynamically identifies key concepts and maps them to contextually relevant pictograms, supporting learners across languages. We evaluated the system across five typologically diverse languages (English, French, Italian, Spanish, and Arabic), through multilingual coverage analysis, expert clinical review by speech therapists and special education professionals, and latency assessment. Evaluation results indicate high pictogram coverage and visual scaffolding density across the five languages. Expert audits suggested that automatically selected pictograms were semantically appropriate, with combined correct and acceptable ratings exceeding 95% for the four European languages and approximately 90% for Arabic despite reduced pictogram repository coverage. System latency remained within interactive thresholds suitable for real-time educational use. These findings support the technical viability, semantic safety, and acceptability of automated multimodal scaffolding to improve accessibility for neurodiverse learners.
[28] MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination cs.CLPDF
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou
TL;DR: 本文提出了一种名为MARCH(Multi-Agent Reinforced Self-Check)的多智能体强化自检框架,用于解决大型语言模型(LLMs)在检索增强生成(RAG)系统中的幻觉问题。该框架通过精心设计的信息不对称机制,组织三个专门化的智能体(求解器、提议者、检查器)协作,并利用多智能体强化学习进行训练,以打破自确认偏误,显著降低幻觉率。
Details
Motivation: 现有基于LLM-as-a-judge的幻觉检测方法存在固有的确认偏误,即验证器会无意中复制原始生成的错误。为了解决这个问题,本文旨在通过强制性的信息不对称来确保严格的事实对齐。
Result: 在多个幻觉基准测试上的广泛实验表明,MARCH显著降低了幻觉率。具体而言,一个配备MARCH的8B参数LLM实现了与强大的闭源模型相竞争的性能。
Insight: 核心创新点在于通过精心设计的信息不对称方案(特别是检查器在隔离状态下验证原子命题,无法访问求解器的原始输出)来打破自确认偏误的循环。此外,利用多智能体强化学习使智能体协同进化以优化事实遵循性,为LLM的事实性自我改进提供了一条可扩展的路径。
Abstract: Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver’s original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at https://github.com/Qwen-Applications/MARCH.
cs.CV [Back]
[29] LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset cs.CV | cs.ROPDF
Royden Wagner, Omer Sahin Tas, Jaime Villa, Felix Hauser, Yinzhe Shen
TL;DR: 该论文提出了KITScenes LongTail数据集,这是一个专注于长尾驾驶场景的端到端驾驶数据集,旨在解决自动驾驶领域对罕见场景泛化的挑战。数据集提供多视角视频、轨迹、高级指令和详细的多语言推理轨迹,支持上下文学习和少样本泛化,并为评估多模态模型(如VLM和VLA)的指令跟随和语义一致性提供了新的基准。
Details
Motivation: 解决自动驾驶在现实世界中泛化到罕见(长尾)场景的根本性挑战,现有数据集和基准在评估模型对复杂、低频事件的适应能力方面存在不足。
Result: 论文提出了一个新的数据集和基准,用于评估多模态模型在长尾驾驶场景下的性能,其评估指标超越了传统安全性和舒适性,侧重于指令跟随和输出语义一致性。
Insight: 创新点在于提供了包含多语言专家推理轨迹的驾驶数据集,这为研究不同形式的推理如何影响驾驶能力提供了独特资源,并促进了基于上下文学习和少样本泛化的端到端驾驶模型开发。
Abstract: In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail
[30] M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production cs.CVPDF
Alexandre Symeonidis-Herzig, Jianhe Low, Ozge Mercanoglu Sincan, Richard Bowden
TL;DR: 本文提出了M3T模型,用于手语生成。它通过SMPL-FX身体模型整合了丰富的手部、面部和身体运动表示,并采用模态特定的有限标量量化VAE进行离散化,训练了一个自回归Transformer来生成多模态运动序列。
Details
Motivation: 现有3D手语生成系统难以有效整合非手动特征(如口型、眉毛、注视和头部运动),因为标准身体模型面部维度低,且采用更丰富表示时,标准离散化方法易出现码本坍缩,导致大部分表情空间无法表达。
Result: 在三个标准基准测试(How2Sign, CSL-Daily, Phoenix14T)上,M3T达到了手语生成质量的SOTA水平。在NMFs-CSL数据集上,其准确率达到58.3%,显著优于最强的可比姿态基线(49.0%)。
Insight: 创新点在于提出了SMPL-FX身体模型来耦合FLAME的丰富表情空间与SMPL-X身体,并采用模态特定的有限标量量化VAE进行离散化,避免了码本坍缩,同时通过辅助翻译目标鼓励语义接地气的嵌入表示。
Abstract: Sign language production requires more than hand motion generation. Non-manual features, including mouthings, eyebrow raises, gaze, and head movements, are grammatically obligatory and cannot be recovered from manual articulators alone. Existing 3D production systems face two barriers to integrating them: the standard body model provides a facial space too low-dimensional to encode these articulations, and when richer representations are adopted, standard discrete tokenization suffers from codebook collapse, leaving most of the expression space unreachable. We propose SMPL-FX, which couples FLAME’s rich expression space with the SMPL-X body, and tokenize the resulting representation with modality-specific Finite Scalar Quantization VAEs for body, hands, and face. M3T is an autoregressive transformer trained on this multi-modal motion vocabulary, with an auxiliary translation objective that encourages semantically grounded embeddings. Across three standard benchmarks (How2Sign, CSL-Daily, Phoenix14T) M3T achieves state-of-the-art sign language production quality, and on NMFs-CSL, where signs are distinguishable only by non-manual features, reaches 58.3% accuracy against 49.0% for the strongest comparable pose baseline.
[31] Ukrainian Visual Word Sense Disambiguation Benchmark cs.CV | cs.AIPDF
Yurii Laba, Yaryna Mohytych, Ivanna Rohulia, Halyna Kyryleyza, Hanna Dydyk-Meush
TL;DR: 本研究为乌克兰语构建了一个视觉词义消歧(Visual-WSD)基准,用于评估模型在给定模糊词和十个图像候选时选择最合适图像表示的能力。该基准的构建方法借鉴了先前英语、意大利语和波斯语基准的工作,旨在支持跨语言模型性能比较。研究使用该基准评估了八个多语言多模态大语言模型,发现所有模型的表现均不及基于CLIP的零样本基线模型,且乌克兰语与英语任务之间存在显著的性能差距。
Details
Motivation: 为乌克兰语构建一个视觉词义消歧基准,以评估和比较多语言多模态模型在该语言上的性能,并纳入更广泛的跨语言模型评估框架。
Result: 在构建的乌克兰语Visual-WSD基准上测试了八个多语言多模态大语言模型,所有模型的表现都低于用于英语Visual-WSD任务的零样本CLIP基线模型,并且乌克兰语任务性能显著低于英语。
Insight: 论文的创新点在于为乌克兰语创建了首个Visual-WSD基准,并系统评估了现有多语言多模态模型在该语言上的能力,揭示了资源较少语言与英语在视觉-语言任务上存在的显著性能差距,强调了跨语言评估的重要性。
Abstract: This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by (CITATION), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal large language models using this benchmark. All tested models performed worse than the zero-shot CLIP-based baseline model (CITATION) used by (CITATION) for the English Visual-WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.
[32] Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge cs.CVPDF
Masoumeh Chapariniya, Aref Farhadipour, Sarah Ebling, Volker Dellwo, Teodora Vukovic
TL;DR: 本文介绍了参加FG 2026 BLEMORE挑战赛的系统,该挑战赛专注于混合情绪识别与相对显著性预测。该系统通过后期概率融合结合了六类编码器家族,包括经过软标签KL训练适配的S4D-ViTMoE面部编码器、冻结层选择的Wav2Vec2音频特征、微调的身体语言编码器(TimeSformer, VideoMAE),以及首次在情绪识别中使用的Gemini Embedding 2.0大模型视频嵌入。实验表明,从冻结Wav2Vec2中选择韵律编码层(6-12)优于端到端微调,个性化表达风格是主要瓶颈,任务适配编码器在集成中权重更高。最终12编码器系统在测试集上获得Score = 0.279,排名第6。
Details
Motivation: 解决BLEMORE挑战赛中混合情绪识别与相对显著性预测的多模态融合问题,旨在通过结合多种基础模型嵌入来提升识别性能。
Result: 在BLEMORE挑战赛测试集上,系统获得Score = 0.279(ACCP = 0.391, ACCS = 0.168),排名第6;其中Gemini Embedding 2.0仅用2秒输入视频嵌入就实现了竞争性的存在准确率(ACCP = 0.320)。
Insight: 创新点包括:首次在情绪识别中应用Gemini Embedding 2.0大模型视频嵌入;发现冻结Wav2Vec2的特定韵律编码层(而非端到端微调)对非语言音频任务更有效;揭示个性化表达风格是显著性预测的主要瓶颈;任务适配编码器在集成中占据主导权重(62%)。
Abstract: We present our system for the BLEMORE Challenge at FG 2026 on blended emotion recognition with relative salience prediction. Our approach combines six encoder families through late probability fusion: an S4D-ViTMoE face encoder adapted with soft-label KL training, frozen layer-selective Wav2Vec2 audio features, finetuned body-language encoders (TimeSformer, VideoMAE), and – for the first time in emotion recognition – Gemini Embedding 2.0, a large multimodal model whose video embeddings produce competitive presence accuracy (ACCP = 0.320) from only 2 seconds of input. Three key findings emerge from our experiments: selecting prosody-encoding layers (6–12) from frozen Wav2Vec2 outperforms end-to-end finetuning (Score 0.207 vs. 0.161), as the non-verbal nature of BLEMORE audio makes phonetic layers irrelevant; the post-processing salience threshold $β$ varies from 0.05 to 0.43 across folds, revealing that personalized expression styles are the primary bottleneck; and task-adapted encoders collectively receive 62% of ensemble weight over general-purpose baselines. Our 12-encoder system achieves Score = 0.279 (ACCP = 0.391, ACCS = 0.168) on the test set, placing 6th.
[33] Estimating Individual Tree Height and Species from UAV Imagery cs.CV | cs.AI | cs.LGPDF
Jannik Endres, Etienne Laliberté, David Rolnick, Arthur Ouaknine
TL;DR: 该论文提出了BIRCH-Trees基准数据集和DINOvTree方法,用于从无人机(UAV)RGB图像中同时估计单株树木的高度和物种。BIRCH-Trees是首个针对此任务的基准,涵盖温带、热带和北方人工林三种森林类型。DINOvTree方法基于视觉基础模型(VFM)骨干网络,配合特定任务头,在参数效率上优于现有方法。
Details
Motivation: 准确估计森林生物量(主要碳汇)依赖于树木高度和物种等性状。使用配备单RGB相机、成本效益高且可扩展的无人机获取高分辨率图像,为单木测绘和测量提供了新途径。
Result: 在提出的BIRCH-Trees基准上进行广泛评估,DINOvTree方法在整体上取得了最佳结果,实现了准确的高度预测和具有竞争力的物种分类精度,同时仅使用了次优方法54%到58%的参数。
Insight: 主要创新点包括:1)构建了首个用于单木高度和物种估计的无人机图像基准数据集BIRCH-Trees,覆盖多种森林生态系统;2)提出了DINOvTree这一统一框架,利用视觉基础模型(VFM)作为骨干,结合特定任务头进行多任务学习,在保持高性能的同时显著提升了参数效率。
Abstract: Accurate estimation of forest biomass, a major carbon sink, relies heavily on tree-level traits such as height and species. Unoccupied Aerial Vehicles (UAVs) capturing high-resolution imagery from a single RGB camera offer a cost-effective and scalable approach for mapping and measuring individual trees. We introduce BIRCH-Trees, the first benchmark for individual tree height and species estimation from tree-centered UAV images, spanning three datasets: temperate forests, tropical forests, and boreal plantations. We also present DINOvTree, a unified approach using a Vision Foundation Model (VFM) backbone with task-specific heads for simultaneous height and species prediction. Through extensive evaluations on BIRCH-Trees, we compare DINOvTree against commonly used vision methods, including VFMs, as well as biological allometric equations. We find that DINOvTree achieves top overall results with accurate height predictions and competitive classification accuracy while using only 54% to 58% of the parameters of the second-best approach.
[34] MoCHA: Denoising Caption Supervision for Motion-Text Retrieval cs.CVPDF
Nikolai Warner, Cameron Ethan Taylor, Irfan Essa, Apaar Sadhwani
TL;DR: MoCHA提出了一种文本规范化框架,用于减少运动-文本检索中由于标注噪声(即同一动作对应多个不同描述)导致的嵌入方差。该框架在编码前将每个文本描述投影到其可恢复的运动内容上,从而产生更紧密的正样本簇和更好分离的嵌入。MoCHA作为预处理步骤,与任何检索架构兼容,并在HumanML3D和KIT-ML基准上实现了新的SOTA性能。
Details
Motivation: 标准的对比学习将每个文本描述视为单一正样本,忽略了同一动作对应多个有效描述的分布结构,导致同一动作的文本嵌入方差过大,从而削弱了运动与文本的对齐效果。
Result: 在HumanML3D和KIT-ML基准上,MoCHA的LLM变体分别将T2M R@1提升了3.1个百分点(至13.9%)和10.3个百分点(至24.3%),达到SOTA水平。LLM-free的T5变体也取得了显著提升。此外,该方法将同一动作内的文本嵌入方差降低了11-19%,并大幅提升了跨数据集迁移性能(HumanML3D到KIT-ML提升94%,反之提升52%)。
Insight: 核心创新在于将文本描述“规范化”为其可仅从运动数据中恢复的语义内容,从而过滤掉标注者特有的风格和无法从3D关节坐标推断的上下文噪声。这揭示了在构建多模态表示时,对文本侧进行去噪和内容标准化是提升对齐质量和模型泛化能力的一个有效通用原则。
Abstract: Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.
[35] Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks cs.CVPDF
Morui Zhu, Yongqi Zhu, Song Fu, Qing Yang
TL;DR: 本文提出了dCAP(动态校准与铰接感知)框架,用于解决自动驾驶卡车因铰接式挂车几何结构和第五轮连接点导致的传感器位姿动态变化问题。该框架基于视觉连续估计牵引车与挂车摄像头之间的6自由度相对位姿,并集成BEVFormer提升3D目标检测性能。同时,作者还发布了基于CARLA的STT4AT基准测试数据集。
Details
Motivation: 现有感知与校准方法假设静态基线或依赖高视差和纹理丰富的场景,无法可靠应对真实世界中铰接式卡车的动态传感器位姿变化,因此需要一种能够持续估计牵引车与挂车相对位姿的鲁棒方法。
Result: 实验表明,dCAP在模拟的半挂卡车基准测试STT4AT上实现了稳定且准确的感知性能,通过动态预测外参替代静态校准,有效解决了自动驾驶卡车中静态校准的局限性。
Insight: 创新点在于提出了结合跨视图与时间注意力的Transformer架构,以鲁棒聚合空间线索并保持时间一致性,从而在快速铰接和遮挡下实现准确感知;同时构建了专门针对铰接式卡车的仿真基准测试数据集,推动了该领域的研究。
Abstract: Autonomous trucking poses unique challenges due to articulated tractor-trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.
[36] Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection cs.CVPDF
Omar Zamzam, Takfarinas Medani, Chinmay Chinara, Richard Leahy
TL;DR: 本文提出了一种基于关节中心注意力模型的可泛化视频癫痫检测方法,通过检测身体关节并提取关节中心视频片段来抑制背景偏差,利用视频视觉变换器(ViViT)对片段进行标记化,并学习跨关节注意力以建模身体部位间的时空交互,从而捕捉癫痫发作的协调运动模式。
Details
Motivation: 现有基于视频的癫痫检测方法因背景偏差和依赖特定受试者外观线索而难以泛化到未见过的受试者,本文旨在通过专注于身体动力学来提升跨受试者泛化能力。
Result: 在广泛的跨受试者实验中,该方法在未见受试者上一致优于基于CNN、图和变换器的最先进方法,达到了SOTA水平。
Insight: 创新点在于提出关节中心注意力模型,通过抑制背景并专注于关节动态交互来增强泛化性;客观分析认为,该方法通过跨关节注意力机制有效建模癫痫特有的协调运动,减少了过拟合,提升了模型的可迁移性。
Abstract: Automated seizure detection from long-term clinical videos can substantially reduce manual review time and enable real-time monitoring. However, existing video-based methods often struggle to generalize to unseen subjects due to background bias and reliance on subject-specific appearance cues. We propose a joint-centric attention model that focuses exclusively on body dynamics to improve cross-subject generalization. For each video segment, body joints are detected and joint-centered clips are extracted, suppressing background context. These joint-centered clips are tokenized using a Video Vision Transformer (ViViT), and cross-joint attention is learned to model spatial and temporal interactions between body parts, capturing coordinated movement patterns characteristic of seizure semiology. Extensive cross-subject experiments show that the proposed method consistently outperforms state-of-the-art CNN-, graph-, and transformer-based approaches on unseen subjects.
[37] Re-Prompting SAM 3 via Object Retrieval: 3rd of the 5th PVUW MOSE Track cs.CVPDF
Mingqi Gao, Sijie Li, Jungong Han
TL;DR: 该技术报告针对PVUW 2026挑战赛中的MOSEv2赛道,提出了一种基于SAM3的自动重提示框架,用于提升复杂半监督视频目标分割任务在目标消失重现、剧烈形变及强同类别干扰下的鲁棒性。方法通过SAM3检测器识别后续帧中的同类别候选目标,并利用基于DINOv3的目标级匹配与变换感知的目标特征池进行可靠目标锚点检索,随后将这些锚点与首帧掩码一同注入SAM~3跟踪器,实现多锚点传播而非仅依赖初始提示。
Details
Motivation: 解决复杂半监督视频目标分割中因目标消失重现、剧烈形变及强同类别干扰导致的跟踪鲁棒性不足的问题。
Result: 在MOSEv2测试集上取得了51.17%的J&F分数,在该赛道排名第三。
Insight: 创新点在于将目标检测与基于DINOv3的特征匹配结合,构建变换感知的目标特征池以实现可靠锚点检索,并通过多锚点重提示机制增强SAM~3在复杂场景下的跟踪鲁棒性;客观分析认为其将通用分割模型(SAM)与通用视觉特征(DINO)结合,为半监督视频分割提供了一种简单有效的提示增强策略。
Abstract: This technical report explores the MOSEv2 track of the PVUW 2026 Challenge, which targets complex semi-supervised video object segmentation. Built on SAM3, we develop an automatic re-prompting framework to improve robustness under target disappearance and reappearance, severe transformation, and strong same-category distractors. Our method first applies the SAM3 detector to later frames to identify same-category object candidates, and then performs DINOv3-based object-level matching with a transformation-aware target feature pool to retrieve reliable target anchors. These anchors are injected back into the SAM~3 tracker together with the first-frame mask, enabling multi-anchor propagation rather than relying solely on the initial prompt. This simple directly benefits several core challenges of MOSEv2. Our solution achieves a J&F of 51.17% on the test set, ranking 3rd in the MOSEv2 track.
[38] Sparse Autoencoders for Interpretable Medical Image Representation Learning cs.CV | cs.LGPDF
Philipp Wesp, Robbie Holland, Vasiliki Sideri-Lampretsa, Sergios Gatidis
TL;DR: 本研究探索使用稀疏自编码器(SAEs)将医学视觉基础模型(如BiomedParse和DINOv3)的抽象潜在表示替换为人类可解释的稀疏特征。通过在TotalSegmentator数据集的909,873个CT和MRI 2D图像切片嵌入上训练SAEs,研究发现稀疏特征能高保真重建原始嵌入,在显著降维(99.4%)后仍保持下游性能,并在图像检索和语言驱动任务中展现语义保真度与可解释性。
Details
Motivation: 解决医学视觉基础模型(FMs)的潜在表示不透明问题,使临床医生能够理解和验证模型编码的信息,从而提升医疗AI系统的可解释性和可信度。
Result: 在TotalSegmentator数据集上,稀疏特征重建原始嵌入的R²高达0.941,仅使用10个特征(降维99.4%)即可恢复87.8%的下游性能;在图像检索任务中保持语义保真度,并能通过LLM实现自动语言解释和零样本语言驱动图像检索。
Insight: 创新点在于将稀疏自编码器应用于医学图像表示学习,实现从抽象潜在空间到可解释稀疏概念的映射,并通过LLM桥接临床语言与视觉表示,为构建概念驱动的可解释医疗视觉系统提供了有效途径。
Abstract: Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to investigate Sparse Autoencoders (SAEs) for replacing opaque FM image representations with human-interpretable, sparse features. We train SAEs on embeddings from BiomedParse (biomedical) and DINOv3 (general-purpose) using 909,873 CT and MRI 2D image slices from the TotalSegmentator dataset. We find that learned sparse features: (a) reconstruct original embeddings with high fidelity (R2 up to 0.941) and recover up to 87.8% of downstream performance using only 10 features (99.4% dimensionality reduction), (b) preserve semantic fidelity in image retrieval tasks, (c) correspond to specific concepts that can be expressed in language using large language model (LLM)-based auto-interpretation. (d) bridge clinical language and abstract latent representations in zero-shot language-driven image retrieval. Our work indicates SAEs are a promising pathway towards interpretable, concept-driven medical vision systems. Code repository: https://github.com/pwesp/sail.
[39] See, Remember, Explore: A Benchmark and Baselines for Streaming Spatial Reasoning cs.CVPDF
Yuxi Wei, Wei Huang, Qirui Chen, Lu Hou, Xiaojuan Qi
TL;DR: 本文提出了S3-Bench,一个用于评估具身智能体在流式空间推理与主动探索能力的基准测试套件,并提出了支持长时程流式推理的AMF-VLM模型。该基准结合了可控模拟器和真实世界流式视频,包含超过10K个场景和26K条轨迹。
Details
Motivation: 现有空间视觉语言模型和基准测试大多是离线评估,忽略了实际部署中两个关键需求:长时程流式推理和当前视图信息不足时的主动感知。
Result: 在S3-Bench的模拟和真实数据子集上,所提出的AMF-VLM模型相比使用相同训练数据的基线模型,分别取得了8.8%和13.3%的性能提升,同时在标准空间基准测试上保持了有竞争力的可迁移性。
Insight: 创新点在于提出了首个专注于流式空间问答与主动探索的基准测试S3-Bench,以及AMF-VLM模型中的两项核心技术:用于压缩长时程观测的’记忆折叠’机制和用于主动获取缺失证据的’主动探索’动作输出模块。
Abstract: Spatial understanding is fundamental for embodied agents, yet most spatial VLMs and benchmarks remain offline-evaluating post-hoc QA over pre-recorded inputs and overlooking two crucial deployment-critical requirements: long-horizon streaming inference and active perception when the current view is insufficient. To address this gap, we introduce S3-Bench, a benchmark suite for streaming spatial question answering with active exploration, where queries are temporally grounded to specific timestamps and must be answered using only observations available up to that moment. S3-Bench adopts a dual-domain design, combining a scalable simulator with controllable trajectories and exploration actions, and real-world streaming videos that capture practical sensing artifacts for rigorous generalization evaluation. Overall, it spans 10K+ scenes and 26K+ trajectories, with dedicated training (S3-Train) and evaluation (S3-Eval) splits. We further propose AMF-VLM, which supports streaming spatial reasoning under bounded computing via (i) memory folding, which compresses long-horizon observations into compact structured memory, and (ii) active exploration, which outputs explicit actions (e.g. move/rotate/scan) to acquire missing evidence before answering. Extensive experiments demonstrate that, compared to models using identical training data, our approach yields improvements of 8.8% and 13.3% on the simulated and real splits of S3-Eval, respectively, while maintaining competitive transferability to standard spatial benchmarks.
[40] MLE-UVAD: Minimal Latent Entropy Autoencoder for Fully Unsupervised Video Anomaly Detection cs.CVPDF
Yuang Geng, Junkai Zhou, Kang Yang, Pan He, Zhuoyang Zhou
TL;DR: 本文提出了一种名为MLE-UVAD(最小潜在熵自编码器)的方法,用于解决完全无监督的视频异常检测问题。该方法直接使用包含正常和异常事件的原始视频进行训练和测试,无需任何标签。其核心是结合标准重建损失和一种新颖的最小潜在熵损失,通过自编码器使正常帧重建良好而异常帧重建较差,从而产生清晰的重建误差差距以实现异常检测。
Details
Motivation: 解决现有视频异常检测方法依赖大量标注(全监督或弱监督)或仅使用正常视频(单类分类)的局限性,这些方法易受分布偏移和污染影响。本文旨在开发一种更鲁棒、完全无监督的单场景VAD方法,直接处理原始未标记视频。
Result: 在两个广泛使用的基准数据集和一个具有挑战性的自收集驾驶数据集上进行了大量实验,结果表明该方法在基线方法中实现了鲁棒且优越的性能。
Insight: 主要创新点在于将最小潜在熵损失引入自编码器框架,通过最小化潜在嵌入的熵,鼓励其集中在高密度区域(正常模式),从而迫使稀疏的异常嵌入被拉入正常簇,导致解码器对异常产生较差重建。这种双损失设计有效地区分了正常与异常的重建质量,为完全无监督异常检测提供了新思路。
Abstract: In this paper, we address the challenging problem of single-scene, fully unsupervised video anomaly detection (VAD), where raw videos containing both normal and abnormal events are used directly for training and testing without any labels. This differs sharply from prior work that either requires extensive labeling (fully or weakly supervised) or depends on normal-only videos (one-class classification), which are vulnerable to distribution shifts and contamination. We propose an entropy-guided autoencoder that detects anomalies through reconstruction error by reconstructing normal frames well while making anomalies reconstruct poorly. The key idea is to combine the standard reconstruction loss with a novel Minimal Latent Entropy (MLE) loss in the autoencoder. Reconstruction loss alone maps normal and abnormal inputs to distinct latent clusters due to their inherent differences, but also risks reconstructing anomalies too well to detect. Therefore, MLE loss addresses this by minimizing the entropy of latent embeddings, encouraging them to concentrate around high-density regions. Since normal frames dominate the raw video, sparse anomalous embeddings are pulled into the normal cluster, so the decoder emphasizes normal patterns and produces poor reconstructions for anomalies. This dual-loss design produces a clear reconstruction gap that enables effective anomaly detection. Extensive experiments on two widely used benchmarks and a challenging self-collected driving dataset demonstrate that our method achieves robust and superior performance over baselines.
[41] BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment cs.CVPDF
Risa Shinoda, Kaede Shiohara, Nakamasa Inoue, Kuniaki Saito, Hiroaki Santo
TL;DR: BioVITA是一个用于生物领域的视觉-文本-声学多模态对齐框架,包含一个大规模训练数据集、一个表征模型和一个检索基准。该框架旨在解决现有生物模型(如BioCLIP)在整合音频模态方面的不足,通过两阶段训练有效对齐音频与视觉、文本表征,并在涵盖三个模态所有方向检索任务的基准上验证了其学习统一表征空间的能力。
Details
Motivation: 现有生物模型(如BioCLIP)在图像与文本分类信息对齐方面表现良好,但整合音频模态以进行物种识别仍是一个开放问题。本文旨在通过构建一个视觉-文本-声学对齐框架,推动多模态生物多样性理解。
Result: 实验表明,该模型学习到了一个统一的表征空间,能够捕捉超越分类学层级的物种级语义。在涵盖三个模态(图像、音频、文本)所有可能方向(如图像到音频、音频到文本等)的跨模态检索基准上进行了广泛评估,基准覆盖了科、属、种三个分类层级。
Insight: 创新点包括:1) 构建了一个大规模、多模态(视觉、文本、声学)生物数据集,包含130万音频片段和230万图像,覆盖14,133个物种并标注了34个生态性状标签;2) 提出了一个基于BioCLIP2的两阶段训练框架,专门用于有效对齐音频表征与视觉、文本表征;3) 建立了一个全面的跨模态检索基准,系统评估三个模态间的所有方向检索任务。从客观角度看,该工作系统性地解决了生物多模态对齐中音频整合的缺口,其数据集和基准为后续研究提供了重要资源。
Abstract: Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding. The project page is available at: https://dahlian00.github.io/BioVITA_Page/
[42] Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training cs.CVPDF
Gengluo Li, Chengquan Zhang, Yupu Liang, Huawen Shen, Yaping Zhang
TL;DR: 本文提出了一种数据与训练协同设计的框架,用于提升端到端文档解析的鲁棒性。该方法通过真实场景合成策略构建大规模、结构多样的全页端到端监督数据,并结合文档感知训练方案(包括渐进式学习和结构令牌优化)来增强结构保真度和解码稳定性。
Details
Motivation: 解决当前基于多模态大语言模型的端到端文档解析方法因缺乏大规模高质量全页解析数据和结构感知训练策略而导致的预测重复、幻觉和结构不一致问题。
Result: 在集成到10亿参数的多模态大语言模型后,该方法在扫描/数字文档和真实世界捕获场景中均实现了优异的准确性和鲁棒性,并在新构建的真实世界基准Wild-OmniDocBench上进行了评估。
Insight: 创新点在于数据合成与训练策略的协同设计:通过布局模板与丰富文档元素组合合成逼真数据以解决数据稀缺,并引入渐进学习和结构令牌优化来提升模型对文档结构的理解与生成稳定性。
Abstract: Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.
[43] MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation cs.CVPDF
Gengluo Li, Chengquan Zhang, Yupu Liang, Huawen Shen, Yaping Zhang
TL;DR: 该论文提出了MMTIT-Bench,一个用于评估端到端图文机器翻译(TIMT)的多语言多场景基准测试集,包含14种非中英文语言的1400张图像。同时,论文还提出了CPR-Trans数据范式,通过整合场景认知、文本感知和翻译推理来提升VLLM在TIMT任务上的性能。
Details
Motivation: 解决端到端图文机器翻译在多样化视觉场景和低资源语言上评估资源有限、鲁棒性不足的问题,并探索如何通过面向推理的数据设计来改进翻译。
Result: 在3B和7B规模的视觉语言大模型上进行的实验表明,CPR-Trans数据范式在翻译准确性和可解释性方面带来了持续提升。
Insight: 创新点在于构建了一个全面的人工验证基准测试集MMTIT-Bench,并提出了CPR-Trans这一将视觉认知与翻译推理统一起来的数据生成范式,超越了现有顺序级联或纯语言推理的方法,为VLLM在TIMT任务上提供了结构化、可解释的监督信号。
Abstract: End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.
[44] Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval cs.CV | cs.AIPDF
Junkai Yang, Qirui Wang, Yaoqing Jin, Shuai Ma, Minghan Xu
TL;DR: 本文提出了一种名为KDC-Net的知识精炼双上下文感知网络,用于解决部分相关视频检索任务中的两大挑战:文本与视频片段间的信息密度不匹配,以及现有注意力机制忽略语义焦点和事件关联的问题。该网络从文本和视觉两个角度入手,通过分层语义聚合模块丰富查询语义,并利用动态时序注意力机制突出关键事件,同时结合了基于动态CLIP的蒸馏策略以优化知识迁移。
Details
Motivation: 解决未修剪视频中检索部分相关片段的难题,核心在于克服文本与视频片段信息密度不匹配,以及现有注意力机制对语义焦点和事件关联关注不足的局限性。
Result: 在PRVR基准测试中,KDC-Net持续超越了现有最先进方法,尤其是在低片段-视频比率条件下表现优异。
Insight: 创新点包括:从文本和视觉双视角构建上下文感知网络;引入分层语义聚合与动态时序注意力机制以分别增强查询语义和突出关键事件;采用结合时序连续性感知精炼的动态CLIP蒸馏策略,实现片段感知且目标对齐的知识迁移。从客观角度看,其多尺度融合与自适应时序窗口的设计对处理视频-文本不对齐问题具有借鉴意义。
Abstract: Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.
[45] GenMask: Adapting DiT for Segmentation via Direct Mask cs.CVPDF
Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai
TL;DR: 本文提出GenMask方法,通过直接生成黑白分割掩码和彩色图像,将分割任务统一到生成式框架中。该方法基于DiT架构,无需额外的特征提取流程,通过设计针对二值掩码的时间步采样策略,解决了掩码潜在表示与自然图像潜在表示之间的分布差异问题。
Details
Motivation: 现有分割方法通常将预训练生成模型作为特征提取器,通过间接特征检索进行下游适应,存在表示不对齐和流程复杂的问题。本文主张分割任务应以生成式方式直接训练,但面临二值掩码与自然图像潜在表示差异的障碍。
Result: 在指代分割和推理分割基准测试中,GenMask达到了最先进的性能,并通过消融实验量化了各组件贡献。
Insight: 创新点在于提出直接生成掩码的生成式分割范式,通过时间步采样策略调和掩码与图像生成的噪声水平差异,实现了在统一生成目标下对DiT架构的直接利用,简化了分割流程并提升了性能。
Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.
[46] Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding cs.CV | cs.LGPDF
Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang
TL;DR: 本文提出了AttentionPack,一种针对大型视觉语言模型(VLMs)的自适应注意力感知优化框架,旨在解决解码过程中因视觉和文本令牌序列过长导致的内存开销问题。该框架通过多头部注意力压缩和令牌特异性注意力感知解压缩机制,显著提升了内存效率,支持更大的批处理量和更快的推理速度,同时保持模型输出质量。
Details
Motivation: 大型视觉语言模型在多模态推理中取得了显著成功,但其推理时间效率因解码过程中的内存开销(尤其是在处理长序列视觉和文本令牌时)而面临重大挑战,特别是在涉及多张高分辨率图像或视频的长上下文任务中。
Result: 在多个基准测试中,AttentionPack将内存效率提升了高达8倍,支持更高的批处理大小和更快的批推理速度,同时保持了模型输出质量或更长的上下文长度以提升检索性能。结合逐出、量化和内核融合技术后,在资源受限环境中进一步提升了效率。
Insight: 创新点在于:(i)利用隐式低秩结构,提出了多头部注意力压缩方法,经济地存储键和值矩阵;(ii)开发了令牌特异性注意力感知解压缩机制以减少延迟开销。从客观角度看,该研究通过注意力机制的优化,为处理长序列多模态输入提供了高效的内存管理方案,可借鉴于其他需要处理大规模序列数据的模型优化中。
Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.
[47] DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning cs.CV | cs.AIPDF
Jiajian Huang, Dongliang Zhu, Zitong YU, Hui Ma, Jiayu Zhang
TL;DR: 本文提出DecepGPT,一个用于多模态欺骗检测的框架,通过构建包含结构化线索描述和推理链的数据集、发布多文化数据集T4-Deception,以及设计SICS和DMC两个模块来增强模型在小数据条件下的鲁棒性和可解释性,在多个基准测试中达到SOTA性能。
Details
Motivation: 解决现有多模态欺骗检测方法中缺乏可验证的中间推理线索、数据集规模小且场景覆盖有限导致捷径学习,以及跨领域和文化背景泛化能力不足的问题。
Result: 在三个现有基准测试和新提出的T4-Deception数据集上,该方法在域内和跨域场景中均达到了最先进的性能,并在不同文化背景下表现出优异的可迁移性。
Insight: 创新点包括:1) 通过结构化线索描述和推理链增强数据集,使模型能输出可审计的报告;2) 发布基于多国电视节目格式的大规模非实验室欺骗检测数据集T4-Deception;3) 提出SICS模块通过可学习全局先验与样本自适应残差的协同以及极性感知调整来精炼多模态表示;4) 提出DMC模块通过知识蒸馏对齐单模态与多模态预测以防止单模态捷径学习。
Abstract: Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth’’ television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.
[48] Uncertainty-Aware Vision-based Risk Object Identification via Conformal Risk Tube Prediction cs.CVPDF
Kai-Yu Fu, Yi-Ting Chen
TL;DR: 本文提出了一种基于视觉的风险对象识别方法,通过构建符合性风险管预测框架来建模时空不确定性,以提升智能驾驶系统中危险检测的鲁棒性。该方法通过提供覆盖保证和校准的风险评分,解决了现有确定性方法在模糊场景下可能导致的过早或延迟检测问题。
Details
Motivation: 现有基于视觉的风险对象识别方法通常采用确定性决策,忽略了不确定性,这可能导致安全关键性故障,特别是在多风险交互的复杂场景中,固定的决策阈值会引发时间上不稳定的预测。
Result: 在提出的新数据集和指标上进行系统评估,该方法相比先前方法有显著改进,增强了视觉风险对象识别的鲁棒性和下游性能,例如减少了误报制动警报。
Insight: 创新点在于引入了符合性风险管预测的统一框架,能够联合建模时空风险不确定性,并提供覆盖保证;同时,研究还系统分析了影响不确定性估计的因素,如场景变化和感知误差传播,为不确定性建模提供了新视角。
Abstract: We study object importance-based vision risk object identification (Vision-ROI), a key capability for hazard detection in intelligent driving systems. Existing approaches make deterministic decisions and ignore uncertainty, which could lead to safety-critical failures. Specifically, in ambiguous scenarios, fixed decision thresholds may cause premature or delayed risk detection and temporally unstable predictions, especially in complex scenes with multiple interacting risks. Despite these challenges, current methods lack a principled framework to model risk uncertainty jointly across space and time. We propose Conformal Risk Tube Prediction, a unified formulation that captures spatiotemporal risk uncertainty, provides coverage guarantees for true risks, and produces calibrated risk scores with uncertainty estimates. To conduct a systematic evaluation, we present a new dataset and metrics probing diverse scenario configurations with multi-risk coupling effects, which are not supported by existing datasets. We systematically analyze factors affecting uncertainty estimation, including scenario variations, per-risk category behavior, and perception error propagation. Our method delivers substantial improvements over prior approaches, enhancing vision-ROI robustness and downstream performance, such as reducing nuisance braking alerts. For more qualitative results, please visit our project webpage: https://hcis-lab.github.io/CRTP/
[49] DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models cs.CVPDF
Hongyi Miao, Jun Jia, Xincheng Wang, Qianli Ma, Wei Sun
TL;DR: 本文提出了一种针对视觉语言模型的新型隐私威胁模型——身份关联学习,攻击者仅需少量目标个体的私人照片即可微调VLM,将目标面部身份与其私有财产和社会关系关联嵌入模型内部表示,从而在公开API部署后通过输入照片暴露用户隐私。为评估VLM对此类泄漏的脆弱性,作者构建了首个身份关联数据集,涵盖私人照片中七种典型场景。实验表明主流VLM(如LLaVA、Qwen-VL、MiniGPT-v2)可通过小规模私人照片甚至合成数据集微调实现身份识别与关联推断。为缓解此风险,作者提出了首个基于数据投毒的私人照片数据集保护框架DP2-VL,通过优化不可感知扰动将原始表示推向对立区域,在VLM编码器嵌入空间中引发数据集级偏移,使受保护图像与干净推理图像分离,导致微调过程过拟合。
Details
Motivation: 针对视觉语言模型在细粒度图像理解能力提升后带来的新型隐私风险——攻击者可通过微调少量私人照片将目标个体身份与其私有属性关联嵌入模型,进而通过公开API泄露隐私,本文旨在量化此类风险并构建防护框架。
Result: 在构建的身份关联数据集上实验显示,主流VLM经小规模私人照片微调后可有效识别面部身份并推断身份关联关系;所提DP2-VL框架在跨模型泛化性、对后处理操作的鲁棒性以及不同保护比例下的有效性方面均表现优异。
Insight: 创新点包括:1)首次提出身份关联学习这一隐私威胁模型并构建对应数据集;2)提出基于数据投毒的VLM隐私保护框架DP2-VL,通过嵌入空间扰动实现数据集级保护;3)验证了合成数据亦可引发隐私泄漏,拓展了威胁场景认知。从客观角度看,该研究将数据投毒技术创造性应用于VLM隐私防护,为对抗性样本防御提供了新视角。
Abstract: Recent advances in visual-language alignment have endowed vision-language models (VLMs) with fine-grained image understanding capabilities. However, this progress also introduces new privacy risks. This paper first proposes a novel privacy threat model named identity-affiliation learning: an attacker fine-tunes a VLM using only a few private photos of a target individual, thereby embedding associations between the target facial identity and their private property and social relationships into the model’s internal representations. Once deployed via public APIs, this model enables unauthorized exposure of the target user’s private information upon input of their photos. To benchmark VLMs’ susceptibility to such identity-affiliation leakage, we introduce the first identity-affiliation dataset comprising seven typical scenarios appearing in private photos. Each scenario is instantiated with multiple identity-centered photo-description pairs. Experimental results demonstrate that mainstream VLMs like LLaVA, Qwen-VL, and MiniGPT-v2, can recognize facial identities and infer identity-affiliation relationships by fine-tuning on small-scale private photographic dataset, and even on synthetically generated datasets. To mitigate this privacy risk, we propose DP2-VL, the first Dataset Protection framework for private photos that leverages Data Poisoning. Though optimizing imperceptible perturbations by pushing the original representations toward an antithetical region, DP2-VL induces a dataset-level shift in the embedding space of VLMs’encoders. This shift separates protected images from clean inference images, causing fine-tuning on the protected set to overfit. Extensive experiments demonstrate that DP2-VL achieves strong generalization across models, robustness to diverse post-processing operations, and consistent effectiveness across varying protection ratios.
[50] Revealing Multi-View Hallucination in Large Vision-Language Models cs.CV | cs.AIPDF
Wooje Park, Insu Lee, Soohyun Kim, Jaeyun Jang, Minyoung Noh
TL;DR: 该论文揭示了大型视觉语言模型在处理多视角图像输入时存在的‘多视角幻觉’问题,即模型会混淆或错配来自不同实例或视角的视觉信息。为此,作者构建了MVH-Bench基准测试,并提出了无需训练的参考移位对比解码技术来缓解这一问题,实验表明该方法显著提升了模型性能。
Details
Motivation: 当前大型视觉语言模型在处理多视角图像时,经常混淆不同实例或视角的视觉信息,即出现多视角幻觉,这限制了其在多视图场景下的可靠应用。
Result: 在构建的MVH-Bench基准上,论文提出的RSCD方法在Qwen2.5-VL和LLaVA-OneVision模型上分别取得了比现有幻觉缓解方法高出21.1和34.6分的性能提升,证明了其有效性。
Insight: 论文的创新点在于首次系统性地定义并量化了多视角幻觉问题,并提出了无需额外训练、通过注意力掩码生成负对数概率来抑制视觉干扰的解码策略,为提升LVLMs在多视图场景下的鲁棒性提供了新思路。
Abstract: Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.
[51] VOLMO: Versatile and Open Large Models for Ophthalmology cs.CV | cs.ETPDF
Zhenyue Qin, Younjoon Chung, Elijah Lee, Wanyue Feng, Xuguang Ai
TL;DR: 本文提出了VOLMO(眼科通用开放大模型),这是一个模型无关、数据开放的框架,用于开发眼科专用的多模态大语言模型。该框架包含三个阶段:基于86,965个图像-文本对进行眼科知识预训练;在26,929个标注实例上进行领域任务微调,涵盖12种眼病的筛查和严重程度分类;以及在913份患者病例报告上进行多步骤临床推理。作者基于此框架训练了一个紧凑的20亿参数模型,并在图像描述生成、疾病筛查与分期分类、评估与管理生成等任务上进行了评估,结果表明VOLMO-2B在各项任务和外部验证中均优于多个基线模型。
Details
Motivation: 全球数百万人受视力障碍影响,早期检测对预防不可逆视力丧失至关重要。眼科临床工作流程需要整合医学图像、结构化临床数据和自由文本笔记,耗时且繁重。现有的通用及医学多模态大语言模型在眼科领域表现不佳,且缺乏开放可用的眼科专用模型。
Result: 在图像描述生成、12种眼病的疾病筛查与分期分类(平均F1分数达87.4%)、以及评估与管理生成等任务上,VOLMO-2B均一致优于InternVL-2B、LLaVA-Med-7B、MedGemma-4B/27B和RETFound等基线模型,并在针对年龄相关性黄斑变性和糖尿病视网膜病变的三个独立队列的外部验证中取得了更高分数。
Insight: 论文的创新点在于提出了一个模型无关、数据开放的三阶段框架(知识预训练、领域任务微调、临床推理),专门用于构建眼科多模态大模型。其可借鉴之处在于针对特定垂直领域(如眼科)的系统化、分阶段模型开发范式,以及通过紧凑模型(2B参数)在专业任务上超越更大规模通用或医学模型的能力,展示了领域专用模型设计的有效性。
Abstract: Vision impairment affects millions globally, and early detection is critical to preventing irreversible vision loss. Ophthalmology workflows require clinicians to integrate medical images, structured clinical data, and free-text notes to determine disease severity and management, which is time-consuming and burdensome. Recent multimodal large language models (MLLMs) show promise, but existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available. We present VOLMO (Versatile and Open Large Models for Ophthalmology), a model-agnostic, data-open framework for developing ophthalmology-specific MLLMs. VOLMO includes three stages: ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles across 82 journals; domain task fine-tuning on 26,929 annotated instances spanning 12 eye conditions for disease screening and severity classification; and multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care. Using this framework, we trained a compact 2B-parameter MLLM and compared it with strong baselines, including InternVL-2B, LLaVA-Med-7B, MedGemma-4B, MedGemma-27B, and RETFound. We evaluated these models on image description generation, disease screening and staging classification, and assessment-and-management generation, with additional manual review by two healthcare professionals and external validation on three independent cohorts for age-related macular degeneration and diabetic retinopathy. Across settings, VOLMO-2B consistently outperformed baselines, achieving stronger image description performance, an average F1 of 87.4% across 12 eye conditions, and higher scores in external validation.
[52] SynMVCrowd: A Large Synthetic Benchmark for Multi-view Crowd Counting and Localization cs.CVPDF
Qi Zhang, Daijie Chen, Yunfei Gong, Hui Huang
TL;DR: 本文提出了一个名为SynMVCrowd的大型合成基准数据集,用于多视角人群计数和定位任务的评估。该数据集包含50个合成场景,具有大量多视角帧、摄像机视角和高达1000人的大规模人群,旨在解决现有小数据集容易过拟合、评估不切实际的问题。论文还提出了在该基准上表现优于所有对比方法的强基线模型,并证明了该数据集有助于提升模型在新真实场景上的领域迁移性能。
Details
Motivation: 现有多视角人群计数和定位方法通常在场景较小、人群数量有限、摄像机视角和帧数较少的数据集上评估,导致方法容易过拟合,评估和比较不切实际。为了解决这一问题,本文旨在创建一个更符合实际应用需求的大型合成基准。
Result: 提出的多视角人群定位和计数基线模型在SynMVCrowd基准上超越了所有对比方法。此外,实验证明,借助该基准,模型在新型真实场景上实现了更好的领域迁移性能,提升了多视角和单图像人群计数的表现。
Insight: 主要创新点在于创建了一个规模更大、更贴近实际应用场景的合成基准数据集SynMVCrowd,以促进多视角和单图像人群计数与定位研究向更实用的方向发展。从客观角度看,该数据集通过合成技术解决了真实数据采集的规模与多样性限制,为模型训练和评估提供了更可靠的平台。
Abstract: Existing multi-view crowd counting and localization methods are evaluated under relatively small scenes with limited crowd numbers, camera views, and frames. This makes the evaluation and comparison of existing methods impractical, as small datasets are easily overfit by these methods. To avoid these issues, 3DROM proposes a data augmentation method. Instead, in this paper, we propose a large synthetic benchmark, SynMVCrowd, for more practical evaluation and comparison of multi-view crowd counting and localization tasks. The SynMVCrowd benchmark consists of 50 synthetic scenes with a large number of multi-view frames and camera views and a much larger crowd number (up to 1000), which is more suitable for large-scene multi-view crowd vision tasks. Besides, we propose strong multi-view crowd localization and counting baselines that outperform all comparison methods on the new SynMVCrowd benchmark. Moreover, we prove that better domain transferring multi-view and single-image counting performance could be achieved with the aid of the benchmark on novel new real scenes. As a result, the proposed benchmark could advance the research for multi-view and single-image crowd counting and localization to more practical applications. The codes and datasets are here: https://github.com/zqyq/SynMVCrowd.
[53] PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning cs.CVPDF
Yankai Wang, Yiding Sun, Qirui Wang, Pengbo Li, Chaoyi Lu
TL;DR: 本文提出PointRFT,首个专为点云表示学习设计的强化微调范式,通过设计精度奖励和分散奖励函数,在少样本分类任务中稳定训练并超越传统监督微调,尤其在数据稀缺场景下与预训练和监督微调结合能实现SOTA性能。
Details
Motivation: 探索强化学习方法(如GRPO)在3D点云感知领域的潜力,解决能否基于RL有效赋能3D点云微调的关键问题。
Result: 在多个少样本分类基准测试中,PointRFT一致优于传统监督微调(SFT);当融入混合预训练-SFT-RFT范式时,在数据稀缺场景下达到最先进(SOTA)性能。
Insight: 创新点在于将强化学习引入点云微调,设计针对点云的专用奖励函数(精度和分散奖励)以稳定训练;客观分析认为其通过混合范式有效释放了基础模型的表征能力,为数据高效3D学习提供了新思路。
Abstract: Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.
[54] Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection cs.CVPDF
Jielun Peng, Yabin Wang, Yaqi Li, Long Kong, Xiaopeng Hong
TL;DR: 本文提出了一种名为HAVIC的深度伪造检测方法,该方法通过预训练学习真实视频中模态内结构连贯性和模态间微观/宏观连贯性先验,并基于这些先验进行整体自适应聚合,以动态融合视听特征进行检测。同时,作者还构建了一个名为HiFi-AVDF的高保真视听深度伪造数据集。
Details
Motivation: 现有深度伪造检测器大多依赖单模态伪影或视听差异,未能联合利用两种信息源,且依赖生成器特定伪影的检测器在面对未见伪造时泛化能力下降。作者认为鲁棒且可泛化的检测应基于模态内和跨模态的内在视听连贯性。
Result: 在多个基准测试上的广泛实验表明,HAVIC显著优于现有最先进方法,在最具挑战性的跨数据集场景中,平均精度(AP)和曲线下面积(AUC)分别提升了9.39%和9.37%,达到了SOTA水平。
Insight: 创新点在于提出了一个基于整体视听内在连贯性(包括模态内结构连贯性和模态间微观/宏观连贯性)的检测框架HAVIC,并引入了动态特征融合机制。此外,构建的高质量数据集HiFi-AVDF有助于推动该领域研究。从客观角度看,该方法将连贯性先验学习与自适应特征聚合相结合,为提升检测器的泛化能力提供了新思路。
Abstract: The rapid progress of generative AI has enabled hyper-realistic audio-visual deepfakes, intensifying threats to personal security and social trust. Most existing deepfake detectors rely either on uni-modal artifacts or audio-visual discrepancies, failing to jointly leverage both sources of information. Moreover, detectors that rely on generator-specific artifacts tend to exhibit degraded generalization when confronted with unseen forgeries. We argue that robust and generalizable detection should be grounded in intrinsic audio-visual coherence within and across modalities. Accordingly, we propose HAVIC, a Holistic Audio-Visual Intrinsic Coherence-based deepfake detector. HAVIC first learns priors of modality-specific structural coherence, inter-modal micro- and macro-coherence by pre-training on authentic videos. Based on the learned priors, HAVIC further performs holistic adaptive aggregation to dynamically fuse audio-visual features for deepfake detection. Additionally, we introduce HiFi-AVDF, a high-fidelity audio-visual deepfake dataset featuring both text-to-video and image-to-video forgeries from state-of-the-art commercial generators. Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state-of-the-art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross-dataset scenario. Our code and dataset are available at https://github.com/tuffy-studio/HAVIC.
[55] SLAT-Phys: Fast Material Property Field Prediction from Structured 3D Latents cs.CV | cs.GR | cs.ROPDF
Rocktim Jyoti Das, Dinesh Manocha
TL;DR: SLAT-Phys是一种端到端方法,可直接从单张RGB图像预测3D资产的空间变化材料属性场(如杨氏模量、密度和泊松比),无需显式3D重建。该方法利用预训练3D资产生成模型的空间组织潜在特征,结合轻量级神经解码器,在保持竞争力的预测精度的同时,大幅降低了计算时间。
Details
Motivation: 现有基于视觉的方法要么计算成本高、速度慢,要么依赖3D信息,因此需要一种能从单张RGB图像快速、准确预测3D资产材料属性场的方法,以支持物理模拟、机器人和数字孪生应用。
Result: 实验表明,与先前方法相比,SLAT-Phys在预测连续材料参数方面具有竞争力的精度,同时在NVIDIA RTXA5000 GPU上每个对象仅需9.9秒,避免了重建和体素化预处理,实现了120倍的加速。
Insight: 创新点在于利用预训练3D生成模型的结构化潜在特征(编码了丰富的几何和语义先验)作为输入,避免了昂贵的3D重建步骤;从客观角度看,将生成模型的潜在空间用于下游物理属性预测是一种高效的跨任务迁移学习思路。
Abstract: Estimating the material property field of 3D assets is critical for physics-based simulation, robotics, and digital twin generation. Existing vision-based approaches are either too expensive and slow or rely on 3D information. We present SLAT-Phys, an end-to-end method that predicts spatially varying material property fields of 3D assets directly from a single RGB image without explicit 3D reconstruction. Our approach leverages spatially organised latent features from a pretrained 3D asset generation model that encodes rich geometry and semantic prior, and trains a lightweight neural decoder to estimate Young’s modulus, density, and Poisson’s ratio. The coarse volumetric layout and semantic cues of the latent representation about object geometry and appearance enable accurate material estimation. Our experiments demonstrate that our method provides competitive accuracy in predicting continuous material parameters when compared against prior approaches, while significantly reducing computation time. In particular, SLAT-Phys requires only 9.9 seconds per object on an NVIDIA RTXA5000 GPU and avoids reconstruction and voxelization preprocessing. This results in 120x speedup compared to prior methods and enables faster material property estimation from a single image.
[56] HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images cs.CVPDF
Yumeng Liu, Xiao-Xiao Long, Marc Habermann, Xuanze Yang, Cheng Lin
TL;DR: 该论文提出了一种名为HGGT的新方法,用于从无标定的图像中重建高保真的3D手部网格。该方法通过一个前馈架构,首次联合推断3D手部网格和相机姿态,旨在解决单视图方法的深度模糊和遮挡问题,同时避免多视图系统对固定、标定设置的依赖,从而在精度和部署灵活性之间取得平衡。
Details
Motivation: 从图像中恢复高保真3D手部几何对于机器人、动画和VR/AR等领域至关重要。现有方法面临困境:单视图方法易于部署但存在深度模糊和遮挡;多视图系统能解决不确定性但通常需要固定的标定设置,限制了实际应用。论文旨在弥合这一差距,利用从视觉数据直接学习显式几何的3D基础模型的灵感,实现从任意无标定视图进行鲁棒且灵活的手部重建。
Result: 广泛的评估表明,该方法在标准基准测试中优于最先进的(SOTA)方法,并在无标定的真实场景中表现出强大的泛化能力。
Insight: 论文的核心创新点在于将任意视图的手部重建重新表述为一个视觉-几何基础任务,并提出了首个能从前馈架构中联合推断3D手部网格和相机姿态的方法。这为处理无标定、非结构化图像数据提供了一种新颖且实用的解决方案,增强了方法的部署灵活性。
Abstract: Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios. Here is the link of our project page: https://lym29.github.io/HGGT/.
[57] UW-VOS: A Large-Scale Dataset for Underwater Video Object Segmentation cs.CVPDF
Hongshen Zhao, Jingkang Tai, Yuhang Wu, Wenkang Zhang, Xi Lan
TL;DR: 该论文提出了首个大规模水下视频目标分割数据集UW-VOS,包含1,431个视频序列、409个类别和309,295个掩码标注,并基于此提出了一个参数高效的框架SAM-U,通过插入轻量级适配器将SAM2适配到水下领域,有效缩小了域间差距。
Details
Motivation: 水下视频目标分割对海洋探索至关重要,但现有方法因水下颜色失真、低对比度和普遍伪装而性能显著下降,主要障碍是缺乏高质量训练数据。
Result: SAM-U在UW-VOS数据集上实现了最先进的性能,仅需约2%的可训练参数;实验表明现有方法在UW-VOS上平均J&F指标下降13点,而SAM-U有效弥合了这一域差距。
Insight: 创新点包括构建首个大规模高质量水下VOS基准数据集,以及提出参数高效的适配器框架SAM-U;客观分析认为其数据引擎构建方法和轻量域适应策略具有借鉴意义,同时基于属性的分析为未来鲁棒水下感知研究指明了方向。
Abstract: Underwater Video Object Segmentation (VOS) is essential for marine exploration, yet open-air methods suffer significant degradation due to color distortion, low contrast, and prevalent camouflage. A primary hurdle is the lack of high-quality training data. To bridge this gap, we introduce $\textbf{UW-VOS}$, the first large-scale underwater VOS benchmark comprising 1,431 video sequences across 409 categories with 309,295 mask annotations, constructed via a semi-automatic data engine with rigorous human verification. We further propose $\textbf{SAM-U}$, a parameter-efficient framework that adapts SAM2 to the underwater domain. By inserting lightweight adapters into the image encoder, SAM-U achieves state-of-the-art performance with only $\sim$2$%$ trainable parameters. Extensive experiments reveal that existing methods experience an average 13-point $\mathcal{J}&\mathcal{F}$ drop on UW-VOS, while SAM-U effectively bridges this domain gap. Detailed attribute-based analysis further identifies small targets, camouflage, and exit-re-entry as critical bottlenecks, providing a roadmap for future research in robust underwater perception.
[58] COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm cs.CV | cs.LGPDF
Zekun Qian, Wei Feng, Ruize Han, Junhui Hou
TL;DR: 该论文提出了COVTrack++,一个用于开放词汇多目标跟踪(OVMOT)的协同框架,旨在解决传统多目标跟踪局限于特定类别的问题。它通过构建首个连续标注的训练集C-TAO来缓解数据瓶颈,并设计了三个模块(MCF、MGA、TCP)来协同处理检测与关联,从而提升对任意类别(包括训练中未见的新颖对象)的跟踪性能。
Details
Motivation: 传统多目标跟踪(MOT)通常局限于少数特定类别,限制了其在涉及多样对象的真实场景中的适用性。开放词汇多目标跟踪(OVMOT)旨在跟踪任意类别,包括训练中未见的新颖对象,但当前进展受到两个挑战的制约:缺乏连续标注的视频数据用于训练,以及缺乏定制的OVMOT框架来协同处理检测与关联。
Result: 在TAO基准测试上,COVTrack++实现了最先进的性能,在验证集和测试集上的新颖TETA指标分别达到35.4%和30.5%,相比先前方法,新颖AssocA提升了4.8%,新颖LocA提升了5.8%。此外,在BDD100K上展示了强大的零样本泛化能力。
Insight: 创新点包括:构建了首个连续标注的OVMOT训练集C-TAO,显著增加了标注密度并捕捉了平滑运动动态;提出了协同框架COVTrack++,通过多线索自适应融合(MCF)、多粒度层次聚合(MGA)和时序置信度传播(TCP)三个模块,实现了检测与关联之间的双向互惠机制,动态平衡外观、运动和语义线索,利用层次空间关系增强关联特征,并通过高置信度跟踪对象恢复闪烁检测以稳定轨迹。
Abstract: Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.
[59] Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection cs.CV | cs.MMPDF
Sa Zhu, Wanqian Zhang, Lin Wang, Xiaohua Chen, Chenxu Cui
TL;DR: 本文提出了一种名为PDA(Phase-wise Decomposition and Alignment)的框架,用于解决开放词汇时序动作检测(OV-TAD)任务中未见动作类别的泛化问题。该方法通过引入CoT提示的语义分解模块,利用大语言模型的思维链推理能力将动作标签自动分解为连贯的阶段级描述;然后通过文本注入的前景过滤模块和自适应阶段对齐模块,实现细粒度的视觉-文本对齐,从而促进可迁移动作模式的学习。
Details
Motivation: 现有OV-TAD方法仅依赖标签级语义与视觉特征的全局对齐,不足以将时序一致的视觉知识从已见类别迁移到未见类别,因此需要更细粒度的动作模式学习机制。
Result: 在两个OV-TAD基准测试上的大量实验证明了该方法的优越性,表明其能显著增强对未见动作的泛化能力。
Insight: 核心创新点在于利用大语言模型的思维链推理进行动作的阶段级语义分解,并设计了相应的阶段级视觉-文本对齐框架,实现了从粗粒度全局对齐到细粒度阶段对齐的范式转变,有助于学习更具可迁移性的动作表示。
Abstract: Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual-textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.
[60] SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision cs.CVPDF
Avigail Cohen Rimon, Amir Mann, Mirela Ben Chen, Or Litany
TL;DR: 本文提出了SpectralSplats,一个用于基于3D高斯泼溅(3DGS)模型的视频跟踪的鲁棒性框架。该框架通过将优化目标从空间域转移到频域,利用频谱矩监督来解决标准光度目标中因相机严重错位导致的梯度消失问题,从而实现了从严重错位初始化中恢复复杂变形的鲁棒跟踪。
Details
Motivation: 3DGS虽然能实现实时、逼真的新视角合成,但其可微渲染器在真实场景中的跟踪应用非常脆弱。根本瓶颈在于高斯基元的紧凑局部支持特性,当相机严重错位导致渲染对象超出目标局部范围时,标准光度目标的梯度会完全消失,使优化器陷入困境。
Result: 论文表明,SpectralSplats可以作为空间损失的无缝替代方案,适用于多种变形参数化方法(从MLP到稀疏控制点),即使在标准基于外观的跟踪会灾难性失败的严重错位初始化情况下,也能成功恢复复杂变形。
Insight: 核心创新点在于将优化监督从空间域转移到频域,通过监督全局复正弦特征(频谱矩)来构建全局吸引域,确保即使像素重叠完全不存在时,整个图像域也存在指向目标的有效方向梯度。此外,论文从第一性原理推导了频率退火调度,以优雅地引导优化器从全局凸性过渡到精确空间对齐,避免了高频引入的周期性局部极小值问题。
Abstract: 3D Gaussian Splatting (3DGS) enables real-time, photorealistic novel view synthesis, making it a highly attractive representation for model-based video tracking. However, leveraging the differentiability of the 3DGS renderer “in the wild” remains notoriously fragile. A fundamental bottleneck lies in the compact, local support of the Gaussian primitives. Standard photometric objectives implicitly rely on spatial overlap; if severe camera misalignment places the rendered object outside the target’s local footprint, gradients strictly vanish, leaving the optimizer stranded. We introduce SpectralSplats, a robust tracking framework that resolves this “vanishing gradient” problem by shifting the optimization objective from the spatial to the frequency domain. By supervising the rendered image via a set of global complex sinusoidal features (Spectral Moments), we construct a global basin of attraction, ensuring that a valid, directional gradient toward the target exists across the entire image domain, even when pixel overlap is completely nonexistent. To harness this global basin without introducing periodic local minima associated with high frequencies, we derive a principled Frequency Annealing schedule from first principles, gracefully transitioning the optimizer from global convexity to precise spatial alignment. We demonstrate that SpectralSplats acts as a seamless, drop-in replacement for spatial losses across diverse deformation parameterizations (from MLPs to sparse control points), successfully recovering complex deformations even from severely misaligned initializations where standard appearance-based tracking catastrophically fails.
[61] A^3: Towards Advertising Aesthetic Assessment cs.CVPDF
Kaiyuan Ji, Yixuan Gao, Lu Sun, Yushuo Zheng, Zijian Chen
TL;DR: 本文提出了A^3(广告美学评估)框架,旨在解决广告图像评估中主观性强、缺乏可扩展性和标准化的问题。该框架包含一个理论驱动的范式(A^3-Law)、一个数据集(A^3-Dataset)、一个多模态大语言模型(A^3-Align)和一个基准(A^3-Bench)。核心模型A^3-Align在A^3-Bench上表现出色,能够有效评估广告美学并推广到广告选择和批判任务。
Details
Motivation: 当前广告图像评估主要依赖主观判断,缺乏可扩展性、标准化标准和可解释性,因此需要一种系统化的客观评估方法。
Result: 在A^3-Bench上的广泛实验表明,A^3-Align相比现有模型能更好地与A^3-Law范式对齐,并且在广告质量选择和处方性广告批判任务上具有良好的泛化能力。
Insight: 创新点在于提出了一个分层的理论驱动评估范式(A^3-Law),并构建了大规模、多维度标注的数据集(A^3-Dataset),同时利用思维链(CoT)指导的多模态大语言模型(A^3-Align)来实现可解释的广告美学评估,为广告领域的自动化评估提供了新思路。
Abstract: Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: https://github.com/euleryuan/A3-Align.
[62] Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics cs.CVPDF
Jipeng Liu, Haichao Shi, Siyu Xing, Rong Yin, Xiao-Yu Zhang
TL;DR: 本文针对基于视觉语言模型(如CLIP)的通用深度伪造检测中存在的表示脱节问题,提出了一种名为优化崩溃(Optimization Collapse)的失效模式,并提出了临界优化半径(COR)和梯度信噪比(GSNR)进行理论分析。基于此,作者设计了对比区域注入Transformer(CoRIT),通过引入对比梯度代理(CGP)和三种免训练策略来缓解优化崩溃,在跨域和通用伪造基准测试中实现了最先进的泛化性能。
Details
Motivation: 现有基于CLIP等视觉语言模型的通用深度伪造检测方法,其以语义为中心的预训练范式难以捕捉超现实合成中固有的非语义伪影,导致泛化能力受限。本文旨在揭示并解决由此引发的优化崩溃问题。
Result: 在跨域和通用伪造检测基准测试上进行了广泛实验,结果表明CoRIT方法有效缓解了优化崩溃,并取得了最先进的(SOTA)泛化性能。
Insight: 核心创新点在于:1) 理论层面,首次形式化定义了优化崩溃现象,并提出临界优化半径(COR)和梯度信噪比(GSNR)作为量化指标,揭示了层间GSNR衰减是导致非语义伪造检测中优化崩溃的根本原因;2) 方法层面,提出了CoRIT框架,其核心是结合了计算高效的对比梯度代理(CGP)和三种免训练策略(区域细化掩码、区域信号注入、分层表示集成),旨在增强梯度保真度,从而提升模型的本质泛化潜力,而非仅缓解优化不稳定的表面症状。
Abstract: While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.
[63] Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification cs.CV | cs.AIPDF
Han Sun, Qin Li, Peixin Wang, Min Zhang
TL;DR: 该论文针对大型视觉语言模型(LVLMs)中的物体幻觉问题,提出了一种名为注意力不平衡矫正(AIR)的解码时干预方法。通过系统实证研究,作者发现跨模态和模态内的注意力分配不平衡与物体幻觉的发生存在强因果关系。AIR通过重新分配注意力权重和调整注意力分布来矫正这种不平衡,从而有效减少幻觉。
Details
Motivation: LVLMs中的物体幻觉严重损害了其在自动驾驶、医学图像分析等高风险场景中应用的可靠性,构成了关键部署障碍。论文旨在解决这一问题,提升模型的可靠性。
Result: 在四个主流LVLM和三个基准测试(CHAIR、POPE、MM-Vet)上,与七个基线方法相比,AIR持续降低了物体幻觉率,最高减少35.1%,同时在不同视觉语言任务上将LVLMs的通用能力最高提升15.9%。
Insight: 论文的创新点在于提出了‘注意力不平衡’这一新概念,用于量化和可视化导致物体幻觉的注意力模式(如过度关注无关语言标记或对判别性视觉特征关注不足),并据此设计了一种轻量级的解码时矫正方法AIR。从客观角度看,该方法通过干预注意力机制来直接解决幻觉的根源,是一种高效且可泛化的策略。
Abstract: Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs’ general capability across diverse vision-language tasks.
[64] AD-Reasoning: Multimodal Guideline-Guided Reasoning for Alzheimer’s Disease Diagnosis cs.CVPDF
Qiuhui Chen, Yushan Deng, Xuancheng Yao, Yi Hong
TL;DR: 本文提出了AD-Reasoning,一个用于阿尔茨海默病诊断的多模态推理框架。该框架结合结构MRI和六种临床模态,并引入基于规则的验证器,以生成符合NIA-AA诊断标准的结构化诊断。方法采用模态特定编码器、双向交叉注意力融合以及强化学习微调,并发布了包含10,378次就诊记录的多模态问答数据集AD-MultiSense。
Details
Motivation: 当前阿尔茨海默病诊断需要整合神经影像与异质临床证据并在既定标准下进行推理,但现有大多数多模态模型不透明且与临床指南对齐性弱。
Result: 在AD-MultiSense数据集上,AD-Reasoning实现了最先进的诊断准确率,并生成了结构化的诊断依据,在透明性方面优于近期基线模型。
Insight: 创新点在于将基于规则的验证器与多模态融合相结合,通过强化学习奖励机制强制输出格式、指南证据覆盖和推理决策一致性,从而提升诊断的透明性和与临床指南的对齐度。
Abstract: Alzheimer’s disease (AD) diagnosis requires integrating neuroimaging with heterogeneous clinical evidence and reasoning under established criteria, yet most multimodal models remain opaque and weakly guideline-aligned. We present AD-Reasoning, a multimodal framework that couples structural MRI with six clinical modalities and a rule-based verifier to generate structured, NIA-AA-consistent diagnoses. AD-Reasoning combines modality-specific encoders, bidirectional cross-attention fusion, and reinforcement fine-tuning with verifiable rewards that enforce output format, guideline evidence coverage, and reasoning–decision consistency. We also release AD-MultiSense, a 10,378-visit multimodal QA dataset with guideline-validated rationales built from ADNI/AIBL. On AD-MultiSense, AD-Reasoning achieves state-of-the-art diagnostic accuracy and produces structured rationales that improve transparency over recent baselines, while providing transparent rationales.
[65] PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation cs.CVPDF
Yuheng Feng, Wen Zhang, Haodong Duan, Xingxing Zou
TL;DR: 本文提出了PosterIQ,一个从设计角度出发的海报理解与生成基准数据集,包含7,765个图像-标注实例和822个生成提示,涵盖真实、专业和合成案例。该基准定义了布局解析、图文对应、排版/可读性与字体感知、设计质量评估以及可控的、具有构图意识和隐喻的生成等任务,旨在弥合视觉设计认知与生成建模之间的差距。通过对前沿MLLMs和基于扩散的生成器进行评估,发现它们在视觉层次、排版语义、显著性控制和意图传达方面存在持续差距。
Details
Motivation: 动机是弥合视觉设计认知与生成建模之间的鸿沟,为海报的理解与生成提供一个以设计原则为核心的、可诊断的基准,以推动生成式视觉-语言系统融入以人为本的设计理念。
Result: 评估了最先进的多模态大语言模型(MLLMs)和基于扩散的生成器。结果表明,商业模型在高级推理任务上领先,但作为评分器不够敏感;生成器能较好渲染文本,但在构图感知合成方面存在困难。整体上,模型在视觉层次、排版语义、显著性控制和意图传达方面与专业设计存在显著差距。
Insight: 创新点在于从设计视角(构图结构、排版层次、语义意图)系统构建了一个多任务、可诊断的基准,不仅用于量化评估,还能作为诊断工具分析模型的设计推理能力。这为将人本设计原则融入生成式AI系统提供了具体路径和可复现的、任务特定的度量标准。
Abstract: We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image-annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text-image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models’ creativity and integrate human-centred design principles into generative vision-language systems.
[66] When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm cs.CV | cs.AI | cs.CRPDF
Ye Leng, Junjie Chu, Mingjie Li, Chenhao Lin, Chao Shen
TL;DR: 本文系统分析了多模态大语言模型(MLLMs)在图像生成中因其更强的语义理解能力而引入的新安全风险,包括不安全内容生成和虚假图像合成,并与扩散模型进行了对比。研究发现,MLLMs在多个不安全生成基准数据集上倾向于生成更多不安全图像,且其生成的虚假图像更难被现有检测器识别,即使重新训练检测器,通过提供更长、更详细的输入仍可绕过检测,表明MLLMs带来的安全风险尚未被充分认识,对现实世界安全构成新挑战。
Details
Motivation: MLLMs作为新兴的统一语言和图像生成范式,相比扩散模型具有更强的语义理解能力,但作者担心这种增强的能力可能引入新的、更大的安全风险,因此旨在系统分析和比较MLLMs与扩散模型在安全风险上的差异。
Result: 在多个不安全生成基准数据集上,MLLMs比扩散模型生成更多不安全图像;对于当前先进的虚假图像检测器,MLLM生成的图像更难识别,即使检测器用MLLM特定数据重新训练,仍可通过提供更长、更详细的输入来绕过检测。
Insight: 论文的创新点在于首次系统地将MLLMs与扩散模型在安全风险维度(不安全内容生成和虚假图像合成)进行对比分析,揭示了MLLMs因其语义理解优势而带来的独特安全挑战,即理解能力本身成为风险来源,这为生成式AI安全研究提供了新视角,强调了在推进模型能力时需同步评估和缓解潜在风险的重要性。
Abstract: Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks. Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis. Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content. For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs. Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.
[67] Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization cs.CVPDF
David Faget, José Luis Lisani, Miguel Colom
TL;DR: 本文提出了一种名为Combi-CAM的新方法,通过结合CNN网络多个层的梯度加权类激活图,而非仅使用最深层的特征,来增强基于CNN的图像地理定位模型的可解释性。
Details
Motivation: 解决行星尺度图像地理定位任务中,深度学习模型(尤其是CNN)预测结果难以解释的问题,旨在提供更清晰的模型决策依据。
Result: 摘要未提及具体的定量结果或基准测试,但宣称该方法比传统方法提供了更深入、更详细的模型决策洞察。
Insight: 创新点在于利用网络多层(而非仅最终层)的梯度信息来生成更全面的解释性热图,这有助于理解不同层次图像特征对地理定位决策的贡献,提升了模型的可解释性深度。
Abstract: Planet-scale photo geolocalization involves the intricate task of estimating the geographic location depicted in an image purely based on its visual features. While deep learning models, particularly convolutional neural networks (CNNs), have significantly advanced this field, understanding the reasoning behind their predictions remains challenging. In this paper, we present Combi-CAM, a novel method that enhances the explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps obtained from several layers of the network architecture, rather than using only information from the deepest layer as is typically done. This approach provides a more detailed understanding of how different image features contribute to the model’s decisions, offering deeper insights than the traditional approaches.
[68] Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection cs.CV | cs.LGPDF
Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang
TL;DR: 本文提出了一种新颖的导师-学生强化学习框架,用于动态优化深度伪造检测的训练课程。该方法将训练过程建模为马尔可夫决策过程,其中导师智能体学习指导学生检测器,通过观察每个训练样本的丰富状态并为其损失分配连续权重来动态调整训练批次,旨在学习更稳健和可泛化的特征。
Details
Motivation: 标准的监督训练对所有样本一视同仁,这可能不利于学习稳健和可泛化的特征,因此需要一种动态优化训练课程的方法来提升深度伪造检测器的性能。
Result: 实验表明,与传统的训练方法相比,这种自适应课程提高了学生检测器对未见过的伪造技术的泛化能力。
Insight: 创新点在于将训练过程建模为MDP,并引入导师智能体,利用包含视觉特征和历史学习动态的丰富状态表示来动态重加权训练样本,从而优先处理高价值样本(如困难但可学习的例子),这是一种新颖的课程学习策略。
Abstract: Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a Tutor'' agent learns to guide a Student’’ (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample’s loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student’s immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student’s generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.
[69] LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds cs.CVPDF
Jaehun Bang, Jinhyeok Kim, Minji Kim, Seungheon Jeong, Kyungdon Joo
TL;DR: LightSplat是一个快速、内存高效的免训练框架,用于开放词汇的3D场景理解。它通过将紧凑的2字节语义索引注入到从多视角图像生成的3D表示中,仅对显著区域分配索引,并使用轻量级索引-特征映射进行管理,从而避免了昂贵的特征优化和存储开销。该方法通过单步聚类确保语义一致性和高效推理,在复杂室内外场景的多个基准测试中实现了最先进的性能。
Details
Motivation: 现有的开放词汇3D场景理解方法由于迭代优化和密集的每个高斯特征分配,存在速度慢、内存占用大和过于复杂的问题。本文旨在解决这些效率瓶颈。
Result: 在LERF-OVS、ScanNet和DL3DV-OVS等复杂室内外场景基准测试中,LightSplat实现了最先进的性能,同时获得了高达50-400倍的加速和64倍的内存降低。
Insight: 主要创新点在于:1) 使用紧凑的2字节语义索引替代密集的高维特征,大幅减少存储和计算开销;2) 仅对显著区域分配索引,结合轻量级映射管理,实现免训练;3) 通过单步聚类链接几何和语义相关的掩码,确保3D语义一致性。该方法为可扩展的语言驱动3D理解提供了高效解决方案。
Abstract: Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page https://vision3d-lab.github.io/lightsplat/.
[70] CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare cs.CVPDF
Akash Ghosh, Tajamul Ashraf, Rishu Kumar Singh, Numan Saeed, Sriparna Saha
TL;DR: 本文提出了CarePilot,一个基于演员-评论家范式的多智能体框架,用于自动化医疗领域内复杂、长周期的计算机任务。为了解决现有多模态智能体在医疗软件长流程自动化上的不足,作者首先构建了一个高质量的人工标注基准CareFlow,涵盖医疗标注工具、DICOM查看器、EHR系统和实验室信息系统中的复杂工作流。CarePilot框架通过演员智能体结合工具调用和双记忆机制来预测语义动作,评论家智能体评估动作并更新记忆,通过迭代模拟学习,在推理时做出更鲁棒和具备推理意识的预测。
Details
Motivation: 现有研究主要关注短周期或通用应用(如移动或桌面界面)的自动化,而针对特定领域(尤其是医疗领域)的长周期软件工作流自动化仍未被充分探索。现有视觉语言模型在医疗上下文的长周期推理和多步交互中表现不佳。
Result: 在作者构建的CareFlow基准测试和分布外数据集上,CarePilot实现了最先进的性能,分别比强大的闭源和开源多模态基线模型高出约15.26%和3.38%。
Insight: 论文的创新点在于:1) 构建了首个针对医疗领域长周期软件工作流的高质量基准CareFlow;2) 提出了基于演员-评论家范式的多智能体框架CarePilot,其核心是演员智能体整合工具调用与双记忆(长短期经验)机制,以及评论家智能体的评估与反馈循环,通过迭代模拟提升推理时的鲁棒性。这为特定领域的长周期任务自动化提供了一个可借鉴的智能体架构范式。
Abstract: Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.
[71] Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection cs.CVPDF
Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao
TL;DR: 本文提出了一种名为HeROD的轻量级、模型无关框架,用于解决数据稀缺条件下的指称目标检测(ROD)问题。该框架通过注入启发式启发的空间和语义推理先验,改进现代DETR风格检测流程中的候选排序、预测融合和匈牙利匹配三个阶段,以提升标签效率和收敛性能。
Details
Motivation: 现有指称目标检测模型通常依赖大量数据,但在机器人、增强现实等实际部署中常面临标签稀缺问题,导致模型需从零学习空间和语义结构,浪费有限样本。本文旨在探索显式推理先验是否能在数据稀缺时帮助模型更高效学习。
Result: 在RefCOCO、RefCOCO+和RefCOCOg基准测试中,HeROD在低数据和少样本设置下持续优于强基线模型,表明其在数据稀缺条件下具有更好的性能。
Insight: 创新点在于引入数据高效的指称目标检测任务(De-ROD)作为评估协议,并提出通过可解释的启发式推理先验来偏置训练和推理过程,为数据高效的视觉语言理解提供了一种实用且可扩展的路径。
Abstract: Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.
[72] Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection cs.CVPDF
Adhemar de Senneville, Xavier Bou, Jérémy Anger, Rafael Grompone, Gabriele Facciolo
TL;DR: 本文提出了一种名为Head Ensemble Classifiers (HEC)的方法,旨在解决大型视觉语言模型(LVLMs)在图像分类任务上表现不佳的问题。该方法通过提示条件化和选择模型内部最具判别力的注意力头,构建了一个无需训练的集成分类器,从而在少样本和零样本分类任务上实现了最先进的性能。
Details
Motivation: 当前的大型视觉语言模型(LVLMs)虽然在零样本任务(如图像描述、视觉问答)上表现出色,但在图像分类任务上却表现不佳,甚至不如基于CLIP的方法。这一差距令人惊讶,因为许多LVLMs使用了CLIP预训练的视觉编码器。论文的动机是探究LVLMs在分类任务上表现不佳的原因,并利用其内部表示(特别是注意力头)来提升分类性能。
Result: 提出的HEC方法在12个数据集上的少样本和零样本分类任务中达到了最先进的性能,成功缩小了基于CLIP的方法与基于LVLM的方法之间的性能差距。
Insight: 论文的创新点在于揭示了LVLMs内部注意力头在分类任务中具有超越模型整体输出的判别能力,并利用高斯判别分析的思想,通过提示条件化和头选择构建了一个无需训练的集成分类器。这为提升LVLMs在分类任务上的性能提供了一种新的、高效的思路。
Abstract: Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP’s architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs’ internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.
[73] RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution cs.CVPDF
Yushuai Song, Weize Quan, Weining Wang, Jiahui Sun, Jing Liu
TL;DR: 本文提出RefReward-SR,一种基于低分辨率(LR)图像参考的奖励模型,用于实现与人类偏好对齐的超分辨率(SR)生成。该方法利用多模态大语言模型(MLLM)的视觉-语言先验,以推理感知的方式评估高分辨率(HR)重建结果与LR输入之间的语义一致性和合理性,并通过新构建的大规模数据集RefSR-18K和组相对策略优化(GRPO)进行训练,以优化SR模型生成与人类感知更一致的图像。
Details
Motivation: 现有超分辨率方法的评估和优化框架与人类感知存在偏差,全参考和无参考指标往往无法准确反映感知偏好,且多数方法依赖与地面真值(GT)的分布匹配,这不一定符合人类判断。
Result: 大量实验表明,该框架在人类判断对齐方面取得了显著更好的效果,生成的重建结果在保持语义一致性的同时,增强了感知合理性和视觉自然度。
Insight: 创新点在于提出了LR条件化的奖励建模范式,将LR图像作为语义锚点来评估HR重建,并构建了首个大规模LR条件偏好数据集RefSR-18K;同时,利用MLLM的推理能力进行语义一致性评估,并通过GRPO将奖励信号集成到SR模型训练中,实现了偏好对齐的生成。
Abstract: Recent advances in generative super-resolution (SR) have greatly improved visual realism, yet existing evaluation and optimization frameworks remain misaligned with human perception. Full-Reference and No-Reference metrics often fail to reflect perceptual preference, either penalizing semantically plausible details due to pixel misalignment or favoring visually sharp but inconsistent artifacts. Moreover, most SR methods rely on ground-truth (GT)-dependent distribution matching, which does not necessarily correspond to human judgments. In this work, we propose RefReward-SR, a low-resolution (LR) reference-aware reward model for preference-aligned SR. Instead of relying on GT supervision or NR evaluation, RefReward-SR assesses high-resolution (HR) reconstructions conditioned on their LR inputs, treating the LR image as a semantic anchor. Leveraging the visual-linguistic priors of a Multimodal Large Language Models (MLLM), it evaluates semantic consistency and plausibility in a reasoning-aware manner. To support this paradigm, we construct RefSR-18K, the first large-scale LR-conditioned preference dataset for SR, providing pairwise rankings based on LR-HR consistency and HR naturalness. We fine-tune the MLLM with Group Relative Policy Optimization (GRPO) using LR-conditioned ranking rewards, and further integrate GRPO into SR model training with RefReward-SR as the core reward signal for preference-aligned generation. Extensive experiments show that our framework achieves substantially better alignment with human judgments, producing reconstructions that preserve semantic consistency while enhancing perceptual plausibility and visual naturalness. Code, models, and datasets will be released upon paper acceptance.
[74] Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement cs.CV | cs.AIPDF
Xin Zhang, Jianyang Xu, Hao Peng, Dongjing Wang, Jingyuan Zheng
TL;DR: 本文提出了一种文本引导的多视角知识蒸馏方法(TMKD),该方法利用视觉教师和文本教师(CLIP)组成的双模态教师模型,通过融合视觉先验(边缘和高频特征)的多视角输入来增强视觉教师,并利用先验感知提示生成的语义权重来指导自适应特征融合,同时引入视觉-语言对比正则化来增强学生模型的语义知识。
Details
Motivation: 现有知识蒸馏方法主要关注蒸馏策略,而往往忽视了提升教师模型知识质量的重要性。本文旨在通过增强教师模型的知识质量来提升知识蒸馏的性能。
Result: 在五个基准测试上的广泛实验表明,TMKD方法能持续提升知识蒸馏性能,最高提升达4.49%,验证了所提出的双教师多视角增强策略的有效性。
Insight: 创新点在于利用双模态教师(视觉与文本)提供更丰富的监督信号,通过多视角视觉先验增强和文本引导的自适应特征融合来提升教师知识质量,并引入视觉-语言对比正则化来强化学生模型的语义理解能力。
Abstract: Knowledge distillation transfers knowledge from large teacher models to smaller students for efficient inference. While existing methods primarily focus on distillation strategies, they often overlook the importance of enhancing teacher knowledge quality. In this paper, we propose Text-guided Multi-view Knowledge Distillation (TMKD), which leverages dual-modality teachers, a visual teacher and a text teacher (CLIP), to provide richer supervisory signals. Specifically, we enhance the visual teacher with multi-view inputs incorporating visual priors (edge and high-frequency features), while the text teacher generates semantic weights through prior-aware prompts to guide adaptive feature fusion. Additionally, we introduce vision-language contrastive regularization to strengthen semantic knowledge in the student model. Extensive experiments on five benchmarks demonstrate that TMKD consistently improves knowledge distillation performance by up to 4.49%, validating the effectiveness of our dual-teacher multi-view enhancement strategy. Code is available at https://anonymous.4open.science/r/TMKD-main-44D1.
[75] HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer cs.CV | cs.LGPDF
Minjun Kim, Minje Kim
TL;DR: HEART-PFL是一个针对异构数据分布的双边个性化联邦学习框架,通过分层方向对齐(HDA)和对抗性知识转移(AKT)来增强客户端特异性并稳定全局更新。
Details
Motivation: 现有PFL方法存在原型对齐浅层和服务器端蒸馏脆弱的问题,旨在解决异构分布下客户端模型的有效个性化与全局稳定性。
Result: 在CIFAR-100、Flowers-102和Caltech-101的Dirichlet非独立同分布划分上,分别达到63.42%、84.23%和95.67%的SOTA个性化准确率,且对域外代理数据保持鲁棒。
Insight: 创新点在于深度感知的分层对齐(早期用余弦相似度、深层用MSE匹配)与基于对称KL散度的对抗性知识转移互补,提升了对齐鲁棒性和优化稳定性。
Abstract: Personalized Federated Learning (PFL) aims to deliver effective client-specific models under heterogeneous distributions, yet existing methods suffer from shallow prototype alignment and brittle server-side distillation. We propose HEART-PFL, a dual-sided framework that (i) performs depth-aware Hierarchical Directional Alignment (HDA) using cosine similarity in the early stage and MSE matching in the deep stage to preserve client specificity, and (ii) stabilizes global updates through Adversarial Knowledge Transfer (AKT) with symmetric KL distillation on clean and adversarial proxy data. Using lightweight adapters with only 1.46M trainable parameters, HEART-PFL achieves state-of-the-art personalized accuracy on CIFAR-100, Flowers-102, and Caltech-101 (63.42%, 84.23%, and 95.67%, respectively) under Dirichlet non-IID partitions, and remains robust to out-of-domain proxy data. Ablation studies further confirm that HDA and AKT provide complementary gains in alignment, robustness, and optimization stability, offering insights into how the two components mutually reinforce effective personalization. Overall, these results demonstrate that HEART-PFL simultaneously enhances personalization and global stability, highlighting its potential as a strong and scalable solution for PFL(code available at https://github.com/danny0628/HEART-PFL).
[76] RVLM: Recursive Vision-Language Models with Adaptive Depth cs.CVPDF
Nicanor Mayumu, Zeenath Khan, Melodena Stephens, Patrick Mukala, Farhad Oroumchian
TL;DR: 该论文提出了RVLM(递归视觉语言模型)框架,旨在解决医疗AI系统的两个核心局限:传统视觉语言模型的黑盒单次推理缺乏可审计性,以及迭代推理系统固定计算预算导致的效率与深度失衡。RVLM通过一个生成-执行的循环,在每一步生成Python代码、调用视觉子代理、操作图像并积累证据,使每个诊断主张都基于可执行代码,满足临床审计要求。同时,其RRouter组件通过轻量级控制器根据任务复杂度自适应预测最优迭代深度,并在推理停滞时提前终止,实现计算效率。
Details
Motivation: 动机是解决医疗AI中传统视觉语言模型缺乏可解释性和审计性,以及现有迭代推理系统因固定迭代预算而无法在简单案例上节省计算、在复杂案例上提供足够深度的问题。
Result: 在BraTS 2023 Meningioma(脑部MRI)和MIMIC-CXR(胸部X光)数据集上,使用未经微调的Gemini 2.5 Flash进行评估。结果显示,RVLM在关键发现(如肿块存在和增强)上具有高一致性,并能检测FLAIR信号特征与分割边界之间的跨模态差异;在MIMIC-CXR上能生成结构化报告并正确识别视图特定伪影。
Insight: 宣称的创新点在于将单次推理替换为可审计的生成-执行循环,使诊断基于可执行代码,并结合自适应深度控制(RRouter)动态优化迭代预算。从客观角度看,其将代码生成与视觉推理结合以实现透明性,以及通过轻量控制器实现计算资源按需分配,是医疗AI可解释性和效率方面有借鉴意义的创新。
Abstract: Medical AI systems face two fundamental limitations. First, conventional vision-language models (VLMs) perform single-pass inference, yielding black-box predictions that cannot be audited or explained in clinical terms. Second, iterative reasoning systems that expose intermediate steps rely on fixed iteration budgets wasting compute on simple cases while providing insufficient depth for complex ones. We address both limitations with a unified framework. RVLM replaces single-pass inference with an iterative generate-execute loop: at each step, the model writes Python code, invokes vision sub-agents, manipulates images, and accumulates evidence. Every diagnostic claim is grounded in executable code, satisfying auditability requirements of clinical AI governance frameworks. RRouter makes iteration depth adaptive: a lightweight controller predicts the optimal budget from task-complexity features, then monitors progress and terminates early when reasoning stalls. We evaluate on BraTS 2023 Meningioma (brain MRI) and MIMIC-CXR (chest X-ray) using Gemini 2.5 Flash without fine-tuning. Across repeated runs, RVLM shows high consistency on salient findings (e.g., mass presence and enhancement) and can detect cross-modal discrepancies between Fluid-Attenuated Inversion Recovery (FLAIR) signal characteristics and segmentation boundaries. On MIMIC-CXR, it generates structured reports and correctly recognises view-specific artefacts. Code: https://github.com/nican2018/rvlm.
[77] Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning cs.CVPDF
Tommaso Galliena, Stefano Rosa, Tommaso Apicella, Pietro Morerio, Alessio Del Bue
TL;DR: 本文提出了一种统一、记忆增强的视觉语言智能体,用于在长时间序列中实现持久且语义一致的对象描述。该模型通过自回归框架同时处理数据关联、对象描述和探索策略,利用RGB观测、探索地图和对象级情景记忆来确保对象身份和语义一致性。
Details
Motivation: 现有视觉语言模型在描述同一对象时,常因视角变化而产生不一致的描述,限制了具身智能体构建一致语义表示的能力。传统方法采用离线多视图聚合或多阶段流程,推理能力有限,无法有效处理先前观察过的对象。
Result: 在手动标注的对象级测试集上,该方法在标准描述评分上比基线模型提升高达+11.86%,在描述自相似性上提升+7.39%,并通过紧凑场景表示实现可扩展性能。
Insight: 创新点包括:将数据关联、对象描述和探索策略统一到单一自回归框架中;引入对象级情景记忆序列化表示以保持对象身份和语义一致性;采用基于不一致性的策略和伪描述模型进行自监督训练,确保多视图描述历史的一致性。
Abstract: Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and an object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set, demonstrate improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through a compact scene representation. Code, model weights, and data are available at https://github.com/hsp-iit/epos-vlm
[78] Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep cs.CV | cs.AIPDF
Tianyi Liu, Ye Lu, Linfeng Zhang, Chen Cai, Jianjun Gao
TL;DR: 该论文提出了一种名为HetCache的无训练扩散加速框架,旨在加速基于扩散的视频编辑过程。它通过分析扩散模型中空间-时间令牌的异质性,将令牌分为上下文令牌和生成令牌,并选择性缓存与生成令牌相关性最强、语义最具代表性的上下文令牌,从而减少冗余的注意力计算,在保持编辑一致性和保真度的同时实现显著加速。
Details
Motivation: 现有基于扩散的视频编辑方法(如Diffusion Transformers, DiT)计算成本高昂,主要由于迭代去噪过程。现有的加速方法主要利用去噪时间步级别的特征重用,但忽视了DiT模型架构内部的冗余,即许多对时空令牌的注意力操作是冗余执行的,对模型输出贡献甚微。
Result: 实验表明,HetCache在常用的基础模型上实现了显著的加速,包括2.67倍的延迟加速和FLOPs减少,同时编辑质量下降可忽略不计。
Insight: 论文的核心创新点在于从模型架构内部冗余(而非仅时间步冗余)的角度进行加速,通过引入空间先验指导,对时空令牌进行异质性分析和分类(上下文令牌 vs. 生成令牌),并基于上下文相关性和交互强度进行选择性缓存,这是一种训练免费且能保持生成质量的有效加速策略。
Abstract: Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the DiT that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model output. This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion-based masked video-to-video (MV2V) generation and editing. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves a noticeable acceleration, including a 2.67$\times$ latency speedup and FLOPs reduction over commonly used foundation models, with negligible degradation in editing quality.
[79] ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors cs.CVPDF
Haodong Yu, Yabo Zhang, Donglin Di, Ruyi Zhang, Wangmeng Zuo
TL;DR: ScrollScape是一个新颖的框架,通过将极端长宽比(EAR)图像生成重新定义为连续视频生成过程,利用视频扩散模型的先验知识来生成超高分辨率(32K)图像,解决了现有扩散模型在生成此类图像时出现的结构崩溃问题。
Details
Motivation: 现有扩散模型在生成极端长宽比(EAR)的超高分辨率图像时,由于缺乏鲁棒的空间先验知识,经常导致灾难性的结构故障,如物体重复和空间碎片化。
Result: 在广泛的评估中,ScrollScape显著优于现有的图像扩散基线,消除了严重的局部伪影,确保了在极端尺度下跨不同领域的卓越全局连贯性和视觉保真度。
Insight: 核心创新在于将空间扩展映射为时间演化,利用视频模型固有的时间一致性作为全局约束;具体技术包括扫描位置编码(ScanPE)和滚动超分辨率(ScrollSR),以高效利用视频先验知识并绕过内存瓶颈。
Abstract: While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation.This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions.To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations.By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity.Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.
[80] TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification cs.CVPDF
Guan Luo, Xiu Li, Rui Chen, Xuanyu Yi, Jing Lin
TL;DR: TopoMesh是一种基于稀疏体素的变分自编码器(VAE),通过统一的Dual Marching Cubes(DMC)拓扑框架,将真实网格和预测网格对齐,从而在3D网格重建中实现高保真度。
Details
Motivation: 现有VAE在3D生成中存在表示不匹配问题:真实网格具有任意拓扑结构,而VAE通常预测固定结构的隐式场(如SDF),导致无法建立显式的网格级对应关系,从而难以保留尖锐特征和几何细节。
Result: 在广泛的实验中,TopoMesh在重建保真度上显著优于现有VAE,实现了对尖锐特征和几何细节的优越保留。
Insight: 创新点在于通过DMC拓扑框架统一输入和输出网格的拓扑结构,建立顶点和面级别的显式对应关系,从而提供明确的网格级监督信号;同时采用稀疏VAE架构、Teacher Forcing和渐进分辨率训练以实现稳定收敛。
Abstract: The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE’s reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the representation mismatch between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topological framework. Specifically, we convert arbitrary input meshes into DMC-compliant representations via a remeshing algorithm that preserves sharp edges using an L$\infty$ distance metric. Our decoder outputs meshes in the same DMC format, ensuring that both predicted and target meshes share identical topological structures. This establishes explicit correspondences at the vertex and face level, allowing us to derive explicit mesh-level supervision signals for topology, vertex positions, and face orientations with clear gradients. Our sparse VAE architecture employs this unified framework and is trained with Teacher Forcing and progressive resolution training for stable and efficient convergence. Extensive experiments demonstrate that TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details.
[81] VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection cs.CVPDF
Jumin Lee, Siyeong Lee, Namil Kim, Sung-Eui Yoon
TL;DR: VERIA是一种以验证为中心的多模态实例增强框架,旨在解决自动驾驶数据集中长尾分布导致的罕见类别样本稀疏问题。它利用现成的基础模型合成同步的RGB-LiDAR实例,并通过顺序语义和几何验证进行筛选,以提升罕见类别的3D物体检测性能。
Details
Motivation: 驾驶数据集中的长尾分布对3D感知构成挑战,罕见类别样本稀疏且现有增强方法在细粒度多样性和场景上下文放置方面受限,需要更有效的增强策略。
Result: 在nuScenes和Lyft数据集上,VERIA在仅LiDAR和多模态设置下均提升了罕见类别的3D物体检测性能,具体表现为SOTA或与现有模型相当的水平。
Insight: 创新点在于采用图像优先的多模态合成框架,结合顺序语义和几何验证来增强实例的多样性和真实性,并通过阶段式产量分解提供管道可靠性诊断,可借鉴其验证中心化设计以提升数据增强的鲁棒性。
Abstract: Long-tail distributions in driving datasets pose a fundamental challenge for 3D perception, as rare classes exhibit substantial intra-class diversity yet available samples cover this variation space only sparsely. Existing instance augmentation methods based on copy-paste or asset libraries improve rare-class exposure but are often limited in fine-grained diversity and scene-context placement. We propose VERIA, an image-first multimodal augmentation framework that synthesizes synchronized RGB–LiDAR instances using off-the-shelf foundation models and curates them with sequential semantic and geometric verification. This verification-centric design tends to select instances that better match real LiDAR statistics while spanning a wider range of intra-class variation. Stage-wise yield decomposition provides a log-based diagnostic of pipeline reliability. On nuScenes and Lyft, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings. Our code is available at https://sgvr.kaist.ac.kr/VERIA/.
[82] RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation cs.CVPDF
Kai Zhu, Zhenyu Cui, Zehua Zang, Jiahuan Zhou
TL;DR: 本文提出了一种用于视频语义分割的细化特定状态空间模型(RS-SSM),旨在解决状态空间模型在压缩过程中遗忘特定信息的问题。通过设计通道幅度感知器(CwAP)提取和对齐状态空间中特定信息的分布特征,并利用遗忘门信息细化器(FGIR)自适应地反转和细化遗忘门矩阵,从而补充性地恢复被遗忘的时空细节,提升像素级分割能力。
Details
Motivation: 状态空间模型通过线性复杂度状态空间压缩实现了高效的视频分割,但视频语义分割需要像素级的时空建模能力以保持语义对象分割的时间一致性。固定大小的状态空间在压缩过程中不可避免地会遗忘特定信息,这限制了模型进行像素级分割的能力。
Result: 在四个视频语义分割基准测试上的广泛实验表明,RS-SSM在保持高计算效率的同时,实现了最先进的性能。
Insight: 创新点在于通过CwAP模块提取特定信息分布,并利用FGIR模块反转遗忘门来细化被遗忘的时空细节,从而在状态空间模型中实现了对压缩过程中丢失的特定信息的互补性恢复,增强了像素级时空建模能力。
Abstract: Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models’ capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model’s capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency. The code is available at https://github.com/zhoujiahuan1991/CVPR2026-RS-SSM.
[83] AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication cs.CVPDF
Jie Song, Jun Jia, Wei Sun, Wangqiu Zhou, Tao Tan
TL;DR: 本文提出AMIF,首个具有内置认证功能的可授权医学图像融合模型,通过将授权访问控制集成到图像融合目标中,保护模型知识产权和敏感训练数据。对于未授权使用,AMIF在融合结果中嵌入显式可见的版权标识符;而成功基于密钥认证后,用户可获得高质量融合结果。
Details
Motivation: 当前多模态图像融合模型缺乏内置机制保护知识产权,导致专有模型知识和敏感训练数据可能通过推理泄漏被恶意利用,如通过模型蒸馏或基于推理的反向工程技术近似专有模型的融合性能。
Result: 论文未在摘要中提及具体的定量实验结果或基准测试对比。
Insight: 创新点在于首次将授权访问控制机制集成到医学图像融合模型中,通过密钥认证控制高质量融合结果的访问,并对未授权使用嵌入可见版权标识,为保护医学影像分析模型的知识产权提供了新思路。
Abstract: Multimodal image fusion enables precise lesion localization and characterization for accurate diagnosis, thereby strengthening clinical decision-making and driving its growing prominence in medical imaging research. A powerful multimodal image fusion model relies on high-quality, clinically representative multimodal training data and a rigorously engineered model architecture. Therefore, the development of such professional radiomics models represents a collaborative achievement grounded in standardized acquisition, clinical-specific expertise, and algorithmic design proficiency, which necessitates protection of associated intellectual property rights. However, current multimodal image fusion models generate fused outputs without built-in mechanisms to safeguard intellectual property rights, inadvertently exposing proprietary model knowledge and sensitive training data through inference leakage. For example, malicious users can exploit fusion outputs and model distillation or other inference-based reverse engineering techniques to approximate the fusion performance of proprietary models. To address this issue, we propose AMIF, the first Authorizable Medical Image Fusion model with built-in authentication, which integrates authorization access control into the image fusion objective. For unauthorized usage, AMIF embeds explicit and visible copyright identifiers into fusion results. In contrast, high-quality fusion results are accessible upon successful key-based authentication.
[84] Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions cs.CVPDF
Shiqin Wang, Haoyang Chen, Huaizhou Huang, Yinkan He, Dongfang Sun
TL;DR: 本文提出了一种启发式自步学习方法,用于恶劣天气条件下的领域自适应语义分割。该方法将课程学习建模为顺序决策问题,通过自主类别调度器动态调整类别学习顺序,结合混合源-目标监督,使网络在每个阶段聚焦于最具信息量的类别,从而提升模型在恶劣条件下的分割性能。
Details
Motivation: 现有课程学习方法通常依赖手工设计的启发式规则(如固定不确定性度量)和静态调度策略,无法适应模型训练过程中高维动态变化,导致类别偏差问题,尤其在恶劣天气条件下的语义分割领域适应任务中表现不佳。
Result: 该方法在ACDC、Dark Zurich和Nighttime Driving三个广泛使用的基准测试上取得了最先进的性能,并在合成到真实场景的语义分割任务中展现出泛化能力。
Insight: 创新点在于将课程学习转化为强化学习中的顺序决策问题,设计了包含高维状态编码器和类别公平策略梯度目标的自主调度器,能够动态提取训练状态关键特征并平衡各类别改进,实现了更自适应的学习过程。
Abstract: The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model’s evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an autonomous class scheduler. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model’s training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source-target supervision, the learned class rankings direct the network’s focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving) and shows generalization ability in synthetic-to-real semantic segmentation.
[85] Counting Without Numbers & Finding Without Words cs.CV | cs.AI | cs.CL | cs.SIPDF
Badri Narayana Patro
TL;DR: 本文提出首个多模态宠物重聚系统,整合视觉与声学生物特征识别技术,突破现有仅依赖外观匹配的局限,通过处理从10Hz大象低频吼叫到4kHz幼犬哀鸣的物种适应性声学架构,结合概率视觉匹配以应对压力导致的外形变化,旨在提升走失宠物与主人的重聚率。
Details
Motivation: 针对每年千万宠物进入收容所且70%无法与主人重聚的问题,现有系统仅依赖外观匹配,而动物实际通过声音相互识别,因此研究旨在开发基于生物通信原则的多模态AI系统,以服务缺乏人类语言的弱势群体。
Result: 论文未在摘要中明确提及具体定量结果或基准测试,但通过整合声学与视觉生物特征,展示了基于生物通信原则的AI系统在宠物重聚应用中的潜力。
Insight: 创新点在于将认知科学中动物近似感知数量和声学身份识别的原理引入AI系统,设计物种自适应的多模态架构,处理跨频率范围的发声并容忍外观变化,为计算机视觉领域处理发声物种提供了新范式。
Abstract: Every year, 10 million pets enter shelters, separated from their families. Despite desperate searches by both guardians and lost animals, 70% never reunite, not because matches do not exist, but because current systems look only at appearance, while animals recognize each other through sound. We ask, why does computer vision treat vocalizing species as silent visual objects? Drawing on five decades of cognitive science showing that animals perceive quantity approximately and communicate identity acoustically, we present the first multimodal reunification system integrating visual and acoustic biometrics. Our species-adaptive architecture processes vocalizations from 10Hz elephant rumbles to 4kHz puppy whines, paired with probabilistic visual matching that tolerates stress-induced appearance changes. This work demonstrates that AI grounded in biological communication principles can serve vulnerable populations that lack human language.
[86] Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing cs.CV | cs.AI | cs.IRPDF
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang
TL;DR: 本文提出了一种名为PaddleOCR-VL的文档解析新架构,采用从粗到细的视觉处理策略,通过轻量级有效区域聚焦模块(VRFM)识别语义相关区域并抑制冗余区域,从而减少视觉令牌数量并降低计算成本。在此基础上训练了一个紧凑而强大的0.9B视觉语言模型进行细粒度识别,在保持高效推理的同时显著提升了文档解析性能。
Details
Motivation: 文档解析作为细粒度任务,高分辨率输入虽能提升性能,但会导致视觉令牌数量二次增长和计算成本显著增加。现有方法在处理文档图像时存在大量视觉区域冗余(如背景),导致效率低下。
Result: 大量实验表明,PaddleOCR-VL在页面级解析和元素级识别任务上均达到了最先进的性能,显著优于现有解决方案,与顶级视觉语言模型相比具有强大竞争力,同时实现了快速推理,并使用了更少的视觉令牌和参数。
Insight: 核心创新在于提出了从粗到细的视觉处理架构,通过VRFM模块引导模型聚焦于语义相关区域,避免了直接处理整个大图像,从而在效率和精度之间取得了良好平衡。这种针对性处理策略为高效准确的文档理解提供了新思路。
Abstract: Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.
[87] Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens cs.CVPDF
Ciem Cornelissen, Sam Leroux, Pieter Simoens
TL;DR: Le MuMo JEPA是一种多模态自监督表示学习框架,通过可学习的融合令牌在共享Transformer中整合RGB图像与对齐的辅助模态(如LiDAR深度或热成像),以学习统一表征。该框架扩展了LeJEPA,采用剪枝融合策略,在跨模态注意力后丢弃模态特定令牌,迫使信息进入共享融合令牌网格作为潜在瓶颈,并应用SIGReg正则化。在Waymo和nuScenes数据集上,它在下游任务(如检测、深度估计和分割)中实现了最佳性能-效率权衡,并在FLIR基准测试中表现优异。
Details
Motivation: 解决现有自监督学习方法大多局限于单模态,未能利用异构传感器(如RGB与LiDAR深度或热成像)的互补结构的问题,旨在通过多模态融合学习更强大的统一表征。
Result: 在Waymo和nuScenes数据集上,Le MuMo JEPA在从头开始的多模态基线中,在下游补丁探测任务上实现了最强的性能-效率权衡,提升了CenterNet检测和密集深度估计性能,同时在分割任务上保持竞争力;在FLIR基准测试中,尤其是在Waymo初始化微调后,取得了最佳结果,并以显著更低的计算、内存和估计训练时间保持了整体精度-效率平衡。
Insight: 创新点包括引入可学习的融合令牌作为跨模态潜在瓶颈,以及剪枝融合策略(丢弃模态特定令牌以强制信息整合),这提高了多模态表征学习的效率和效果;从客观角度看,该框架通过共享Transformer和SIGReg正则化,有效整合了异构模态,为多模态自监督学习提供了可扩展且高效的解决方案。
Abstract: Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing cross-modal information into the shared fusion-token grid as an efficient latent bottleneck before Sketched Isotropic Gaussian Regularization (SIGReg) is applied to the joint multimodal CLS embedding. On Waymo, Le MuMo JEPA gives the strongest performance-efficiency trade-off on downstream patch probes among the from-scratch multimodal baselines, improving CenterNet detection and dense depth while remaining competitive on segmentation. Under from-scratch training on nuScenes, Le MuMo JEPA remains the strongest model, and it also gives the best FLIR results, especially after Waymo-initialized fine-tuning. It also retains the best overall accuracy-efficiency balance in our study at substantially lower compute, memory, and estimated training time.
[88] Language-Guided Structure-Aware Network for Camouflaged Object Detection cs.CV | cs.AIPDF
Min Zhang
TL;DR: 本文提出了一种语言引导的结构感知网络(LGSAN)用于伪装目标检测(COD)。该方法基于视觉骨干网络PVT-v2,引入CLIP模型通过文本提示和RGB图像生成掩码,以引导多尺度特征关注潜在目标区域。此外,设计了傅里叶边缘增强模块(FEEM)在频域整合多尺度特征与高频信息,结构感知注意力模块(SAAM)增强对目标结构和边界的感知,以及粗粒度引导的局部细化模块(CGLRM)提升伪装目标区域的细粒度重建和边界完整性。
Details
Motivation: 现有伪装目标检测方法通常缺乏文本语义先验的引导,限制了模型在复杂场景中聚焦伪装区域的能力。本文旨在通过引入语言引导和结构感知机制来解决这一问题。
Result: 在多个COD数据集上的广泛实验表明,该方法始终取得了极具竞争力的性能,验证了其有效性和鲁棒性。
Insight: 创新点在于将CLIP的文本-图像对齐能力引入COD任务,利用文本提示生成掩码作为先验引导视觉特征;同时,通过频域分析(FEEM)和结构感知注意力(SAAM)增强边缘和结构信息,并结合局部细化模块(CGLRM)提升细节重建,实现了语言与视觉的多模态融合及结构感知的增强。
Abstract: Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with the background in terms of color, texture, and structure, making it a highly challenging task in computer vision. Although existing methods introduce multi-scale fusion and attention mechanisms to alleviate the above issues, they generally lack the guidance of textual semantic priors, which limits the model’s ability to focus on camouflaged regions in complex scenes. To address this issue, this paper proposes a Language-Guided Structure-Aware Network (LGSAN). Specifically, based on the visual backbone PVT-v2, we introduce CLIP to generate masks from text prompts and RGB images, thereby guiding the multi-scale features extracted by PVT-v2 to focus on potential target regions. On this foundation, we further design a Fourier Edge Enhancement Module (FEEM), which integrates multi-scale features with high-frequency information in the frequency domain to extract edge enhancement features. Furthermore, we propose a Structure-Aware Attention Module (SAAM) to effectively enhance the model’s perception of object structures and boundaries. Finally, we introduce a Coarse-Guided Local Refinement Module (CGLRM) to enhance fine-grained reconstruction and boundary integrity of camouflaged object regions. Extensive experiments demonstrate that our method consistently achieves highly competitive performance across multiple COD datasets, validating its effectiveness and robustness.
[89] PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks cs.CVPDF
Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu
TL;DR: 本文介绍了PP-OCRv5,一个仅含500万参数的轻量级OCR系统。通过以数据为中心的研究方法,系统分析了训练数据的难度、准确性和多样性,证明了在拥有足够高质量数据的情况下,传统高效的两阶段OCR流水线性能远超预期,能够与数十亿参数的大规模视觉语言模型在标准OCR基准上竞争,同时提供更优的文本定位精度和更少的幻觉问题。
Details
Motivation: 解决当前‘OCR 2.0’和大规模视觉语言模型(VLMs)存在的计算需求高、在复杂布局中文本定位不精确以及容易产生文本幻觉的问题,并挑战模型规模是获得高精度的唯一途径这一普遍观念。
Result: PP-OCRv5在标准OCR基准测试中取得了与许多数十亿参数VLMs相竞争的性能,同时具有更优的定位精度和更少的幻觉。
Insight: 主要创新点在于通过数据为中心的研究,量化了训练数据的难度、准确性和多样性三个关键维度,证明了高质量数据对提升传统轻量OCR模型性能上限的决定性作用,为大数据时代轻量级专用模型的可行性提供了有力证据,并提供了OCR数据管理的实用见解。
Abstract: The advent of “OCR 2.0” and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.
[90] GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization cs.CVPDF
Pengyue Jia, Derong Xu, Yingyi Zhang, Xiaopeng Li, Wenlin Zhang
TL;DR: 本文提出GeoRouter,一种用于全球图像地理定位的动态路由框架,通过分析视觉内容自适应地为每个查询选择最优范式(检索或生成),以结合检索方法的细粒度实例匹配能力和生成方法的鲁棒语义推理能力,从而提升定位精度。
Details
Motivation: 现有全球图像地理定位方法主要分为基于检索和基于生成(利用大型视觉语言模型)两种范式,但各自存在局限性:检索擅长细粒度匹配,生成擅长语义推理,没有单一范式在所有情况下都最优。因此,需要一种动态机制来根据查询内容选择最佳范式。
Result: 在IM2GPS3k和YFCC4k基准测试上的大量实验表明,GeoRouter显著超越了现有最先进方法(SOTA)。
Insight: 创新点包括:1)提出动态路由框架GeoRouter,利用LVLM骨干网络分析视觉内容并做出路由决策;2)引入距离感知偏好目标,将范式间的距离差距转化为连续监督信号,以明确反映性能差异;3)构建首个用于训练路由策略的大规模数据集GeoRouting,包含独立的范式预测数据。从客观角度看,该研究通过范式互补性优化系统性能,为多范式协同工作提供了新思路。
Abstract: Worldwide image geolocalization aims to predict precise GPS coordinates for images captured anywhere on Earth, which is challenging due to the large visual and geographic diversity. Recent methods mainly follow two paradigms: retrieval-based approaches that match queries against a reference database, and generation-based approaches that directly predict coordinates using Large Vision-Language Models (LVLMs). However, we observe distinct error profiles between them: retrieval excels at fine-grained instance matching, while generation offers robust semantic reasoning. This complementary heterogeneity suggests that no single paradigm is universally superior. To harness this potential, we propose GeoRouter, a dynamic routing framework that adaptively assigns each query to the optimal paradigm. GeoRouter leverages an LVLM backbone to analyze visual content and provide routing decisions. To optimize GeoRouter, we introduce a distance-aware preference objective that converts the distance gap between paradigms into a continuous supervision signal, explicitly reflecting relative performance differences. Furthermore, we construct GeoRouting, the first large-scale dataset tailored for training routing policies with independent paradigm predictions. Extensive experiments on IM2GPS3k and YFCC4k demonstrate that GeoRouter significantly outperforms state-of-the-art baselines.
[91] ViHOI: Human-Object Interaction Synthesis with Visual Priors cs.CVPDF
Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding
TL;DR: 论文提出ViHOI框架,利用从2D图像中提取的视觉先验来生成逼真且物理合理的3D人-物交互(HOI)运动。该方法采用大型视觉语言模型(VLM)作为先验提取引擎,通过层解耦策略获取视觉和文本先验,并使用基于Q-Former的适配器压缩特征,以条件扩散模型进行生成。在推理时,利用文本到图像生成模型合成的参考图像来提升对未见物体和交互类别的泛化能力。
Details
Motivation: 生成逼真且物理合理的3D人-物交互(HOI)运动是一个关键挑战,因为仅用文字描述物理约束很困难。因此,论文旨在从易于获取的2D图像中提取丰富的交互先验,以克服这一限制。
Result: 实验结果表明,ViHOI在多个基准测试中实现了最先进的性能,优于现有方法,并展现出卓越的泛化能力。
Insight: 创新点包括:提出从2D图像提取视觉先验的新范式,利用VLM作为先验提取引擎,采用层解耦策略和Q-Former适配器压缩特征,以及通过合成参考图像提升泛化。从客观角度看,该方法将视觉先验与扩散模型结合,为HOI生成提供了更丰富的条件信息,有助于解决物理约束描述难题。
Abstract: Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM’s high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.
[92] Causal Transfer in Medical Image Analysis cs.CVPDF
Mohammed M. Abdelsamea, Daniel Tweneboah Anyimadu, Tasneem Selim, Saif Alzubi, Lei Zhang
TL;DR: 这篇综述论文系统性地介绍了医学图像分析中的因果迁移学习(CTL)范式,旨在解决因领域偏移导致的模型泛化失败问题。它将因果推理与跨领域表示学习相结合,通过结构化因果模型、不变风险最小化和反事实推理等方法,提升临床AI的鲁棒性和可泛化性。
Details
Motivation: 医学影像模型在跨医院、扫描仪、人群或成像协议部署时,常因领域偏移而失效,限制了其临床可靠性。传统迁移学习和领域适应方法依赖可能变化的虚假相关性,而因果推断提供了识别跨环境稳定不变机制的原理性途径。
Result: 论文综述了因果迁移学习在分类、分割、重建、异常检测和多模态成像等任务上的应用,总结了相关数据集、基准测试和经验性收益,并强调了因果迁移在何时及为何优于基于相关性的领域适应方法。
Insight: 论文的核心创新点在于将领域偏移问题框架化为因果问题,并提出了一个连接因果框架与迁移机制的统一分类法。它系统性地整合了因果推理与表示学习,为构建公平、鲁棒、可信赖的多机构或联邦部署的医学影像AI提供了新范式。
Abstract: Medical imaging models frequently fail when deployed across hospitals, scanners, populations, or imaging protocols due to domain shift, limiting their clinical reliability. While transfer learning and domain adaptation address such shifts statistically, they often rely on spurious correlations that break under changing conditions. On the other hand, causal inference provides a principled way to identify invariant mechanisms that remain stable across environments. This survey introduces and systematises Causal Transfer Learning (CTL) for medical image analysis. This paradigm integrates causal reasoning with cross-domain representation learning to enable robust and generalisable clinical AI. We frame domain shift as a causal problem and analyse how structural causal models, invariant risk minimisation, and counterfactual reasoning can be embedded within transfer learning pipelines. We studied spanning classification, segmentation, reconstruction, anomaly detection, and multimodal imaging, and organised them by task, shift type, and causal assumption. A unified taxonomy is proposed that connects causal frameworks and transfer mechanisms. We further summarise datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. Finally, we discuss how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings, and outline open challenges and research directions for clinically reliable medical imaging AI.
[93] The Gait Signature of Frailty: Transfer Learning based Deep Gait Models for Scalable Frailty Assessment cs.CVPDF
Laura McDaniel, Basudha Pal, Crystal Szczesny, Yuxiang Guo, Ryan Roemmich
TL;DR: 本文提出了一种基于步态分析的衰弱评估方法,通过迁移学习将预训练的步态识别模型应用于临床步态数据集,实现了对衰弱状态的可扩展、非侵入性评估。
Details
Motivation: 解决临床实践中衰弱评估主观性强、异质性高、难以规模化的问题,利用步态作为生物衰老的敏感标志物,结合计算机视觉技术进行客观评估。
Result: 在公开的临床步态数据集上评估了卷积和混合注意力架构,通过选择性冻结低层步态表示并结合互补学习目标,提升了模型在有限数据条件下的分类性能,增强了不同衰弱状态间的区分能力。
Insight: 创新点在于提出选择性冻结策略以稳定迁移学习性能,结合类别不平衡处理和互补学习目标优化模型;客观分析表明,该方法将生物识别建模与衰老研究结合,提供了可解释的步态特征关注区域(如下肢和骨盆),与已知生物力学关联一致。
Abstract: Frailty is a condition in aging medicine characterized by diminished physiological reserve and increased vulnerability to stressors. However, frailty assessment remains subjective, heterogeneous, and difficult to scale in clinical practice. Gait is a sensitive marker of biological aging, capturing multisystem decline before overt disability. Yet the application of modern computer vision to gait-based frailty assessment has been limited by small, imbalanced datasets and a lack of clinically representative benchmarks. In this work, we introduce a publicly available silhouette-based frailty gait dataset collected in a clinically realistic setting, spanning the full frailty spectrum and including older adults who use walking aids. Using this dataset, we evaluate how pretrained gait recognition models can be adapted for frailty classification under limited data conditions. We study both convolutional and hybrid attention-based architectures and show that predictive performance depends primarily on how pretrained representations are transferred rather than architectural complexity alone. Across models, selectively freezing low-level gait representations while allowing higher-level features to adapt yields more stable and generalizable performance than either full fine-tuning or rigid freezing. Conservative handling of class imbalance further improves training stability, and combining complementary learning objectives enhances discrimination between clinically adjacent frailty states. Interpretability analyses reveal consistent model attention to lower-limb and pelvic regions, aligning with established biomechanical correlates of frailty. Together, these findings establish gait-based representation learning as a scalable, non-invasive, and interpretable framework for frailty assessment and support the integration of modern biometric modeling approaches into aging research and clinical practice.
[94] Unleashing Vision-Language Semantics for Deepfake Video Detection cs.CVPDF
Jiawen Zhu, Yunqi Miao, Xueyi Zhang, Jiankang Deng, Guansong Pang
TL;DR: 本文提出VLAForge框架,通过利用预训练视觉语言模型(如CLIP)中丰富的跨模态语义信息来增强深度伪造视频检测能力。该方法引入ForgePerceiver模块以细粒度和整体方式捕捉伪造线索,同时保持预训练的视觉语言对齐知识,并生成身份感知的VLA分数作为补充判别线索。
Details
Motivation: 现有深度伪造视频检测方法主要利用视觉特征,忽略了预训练视觉语言模型中蕴含的丰富跨模态语义信息,而这些信息可能显著提升模型的判别能力。
Result: 在包括经典换脸伪造和近期全脸生成伪造在内的视频深度伪造检测基准测试中,VLAForge在帧级别和视频级别均大幅优于现有最先进方法。
Insight: 创新点在于首次系统性地利用视觉语言模型的跨模态语义进行深度伪造检测,通过ForgePerceiver模块和身份先验引导的文本提示来增强对特定身份的伪造线索捕捉,从而释放了视觉语言对齐知识的判别潜力。
Abstract: Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength – the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model’s discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue – Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.
[95] OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning cs.CVPDF
Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong
TL;DR: 本文提出了OmniWeaving,一个旨在实现统一视频生成的开源模型,它能够处理自由形式的文本、多图像和视频输入组合,并进行推理以理解复杂用户意图。同时,论文还引入了首个用于评估智能统一视频生成的综合性基准IntelligentVBench。实验表明,该模型在开源统一模型中达到了最先进的性能。
Details
Motivation: 当前开源视频生成模型在统一多种任务的能力上显著落后于如Seedance-2.0等专有系统,且现有学术模型大多功能割裂。本文旨在弥合这一差距,构建一个能无缝整合多样化任务、具备强大组合与推理能力的统一视频生成框架。
Result: 广泛的实验表明,OmniWeaving在开源统一视频生成模型中达到了最先进的性能,其评估基于新提出的综合性基准IntelligentVBench。
Insight: 核心创新在于提出了一个通过大规模预训练数据(涵盖多样化组合与推理增强场景)学习的统一框架,能够时序绑定交错的多模态输入并作为智能体推断用户意图。此外,创建IntelligentVBench基准为评估下一代智能统一视频生成提供了首个严谨标准。
Abstract: While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: https://omniweaving.github.io.
[96] Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories cs.CV | cs.HC | cs.IRPDF
Kawtar Zaher, Olivier Buisson, Alexis Joly
TL;DR: 本文提出了一种名为’Positive-First Most Ambiguous’(PF-MA)的主动学习准则,用于解决在高度不平衡、低预算和低延迟设置下的交互式细粒度视觉检索问题。该方法通过优先选择边界附近且更可能为正类的样本,有效提升了在长尾数据集中对稀有类别的发现效率和检索性能。
Details
Motivation: 现实世界中的细粒度视觉检索(如生物多样性监测)常需从大量未标注数据中发现稀有概念,形成高度不平衡的二分类问题。传统的主动学习方法假设类别先验对称且标注预算充足,在此类不平衡、低预算的交互式检索场景中效果有限。
Result: 在包括细粒度植物数据在内的长尾数据集上的实验表明,PF-MA在类别覆盖率和分类器性能方面均持续优于强基线方法,且在不同类别规模和描述符下表现稳健。
Insight: 核心创新在于提出了一种简单有效的主动学习准则PF-MA,它明确处理了类别不平衡的不对称性,优先选择边界附近且可能为正的样本,从而快速发现视觉上细微的类别。此外,论文还提出了一个衡量所选正类样本覆盖目标类别视觉多样性的’类别覆盖率’指标,以更好地评估检索多样性。该方法将主动学习与交互式细粒度检索的不对称性和以用户为中心的目标对齐,为现实人机交互场景中的稀有类别检索提供了强大而简单的解决方案。
Abstract: Real-world fine-grained visual retrieval often requires discovering a rare concept from large unlabeled collections with minimal supervision. This is especially critical in biodiversity monitoring, ecological studies, and long-tailed visual domains, where the target may represent only a tiny fraction of the data, creating highly imbalanced binary problems. Interactive retrieval with relevance feedback offers a practical solution: starting from a small query, the system selects candidates for binary user annotation and iteratively refines a lightweight classifier. While Active Learning (AL) is commonly used to guide selection, conventional AL assumes symmetric class priors and large annotation budgets, limiting effectiveness in imbalanced, low-budget, low-latency settings. We introduce Positive-First Most Ambiguous (PF-MA), a simple yet effective AL criterion that explicitly addresses the class imbalance asymmetry: it prioritizes near-boundary samples while favoring likely positives, enabling rapid discovery of subtle visual categories while maintaining informativeness. Unlike standard methods that oversample negatives, PF-MA consistently returns small batches with a high proportion of relevant samples, improving early retrieval and user satisfaction. To capture retrieval diversity, we also propose a class coverage metric that measures how well selected positives span the visual variability of the target class. Experiments on long-tailed datasets, including fine-grained botanical data, demonstrate that PF-MA consistently outperforms strong baselines in both coverage and classifier performance, across varying class sizes and descriptors. Our results highlight that aligning AL with the asymmetric and user-centric objectives of interactive fine-grained retrieval enables simple yet powerful solutions for retrieving rare and visually subtle categories in realistic human-in-the-loop settings.
[97] Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models cs.CVPDF
Siqi Liu, Xinyang Li, Bochao Zou, Junbao Zhuo, Huimin Ma
TL;DR: 本文提出VisionToM框架,旨在通过视觉导向的干预增强多模态大语言模型(MLLMs)的心理理论(ToM)能力。该框架通过计算干预向量来对齐视觉表征与语义目标,引导模型注意力,减少对虚假语言先验的依赖,从而在基于视频的ToM任务中提升性能。
Details
Motivation: 现有ToM评估主要关注文本输入,而依赖纯视觉信息的场景研究不足;同时,当前方法多将模型视为黑箱,很少探究其内部注意力在多选问答中的行为,且从可解释性角度研究LLM幻觉影响的工作较少。
Result: 在EgoToM基准(一个以自我为中心的真实世界视频ToM数据集,包含三种多选QA设置)上的实验表明,该方法显著提升了MLLMs的ToM能力;在额外开放生成任务中,VisionToM使MLLMs能生成更准确捕捉智能体心理状态的自由形式解释。
Insight: 创新点在于提出视觉导向的干预框架,通过干预向量对齐视觉与语义,以可解释方式引导模型注意力,减少语言先验依赖,从而增强多模态ToM推理;客观上,该方法为多模态理解中的注意力机制干预提供了新思路,有助于提升模型在真实世界交互中的对齐性。
Abstract: As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model’s attention through different layers of visual features. This guidance reduces the model’s reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents’ mental states, pushing machine-human collaboration toward greater alignment.
[98] Toward Physically Consistent Driving Video World Models under Challenging Trajectories cs.CVPDF
Jiawei Zhou, Zhenxin Zhu, Lingyi Du, Linye Lyu, Lijun Zhou
TL;DR: 这篇论文提出了PhyGenesis,一个旨在生成具有高视觉保真度和强物理一致性的驾驶视频的世界模型。它通过物理条件生成器和物理增强视频生成器两个关键组件,解决了现有模型在具有挑战性或反事实轨迹条件下产生物理不一致视频的问题。
Details
Motivation: 现有视频生成模型主要基于真实世界驾驶数据集训练,这些数据集大多包含自然和安全的驾驶场景,导致模型在面对模拟器或规划系统生成的具有挑战性或反事实轨迹时,会产生严重物理不一致和伪影的视频。
Result: 大量实验表明,PhyGenesis在具有挑战性的轨迹上持续优于最先进的方法,特别是在物理一致性方面表现出色。
Insight: 论文的创新点在于提出了一个包含物理条件生成器和物理增强视频生成器的框架,并构建了一个大规模、物理丰富的异构数据集(结合真实视频和CARLA模拟器生成的挑战性场景)来训练模型,通过挑战性轨迹学习策略实现轨迹校正和物理一致的视频生成。
Abstract: Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real-world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories-such as imperfect trajectories generated by simulators or planning systems-producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics-enhanced video generator that produces high-fidelity multi-view driving videos under these conditions. To effectively train these components, we construct a large-scale, physics-rich heterogeneous dataset. Specifically, in addition to real-world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories. Our project page is available at: https://wm-research.github.io/PhyGenesis/.
[99] Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification cs.CVPDF
Dipam Goswami, Simone Magistri, Gido M. van de Ven, Bartłomiej Twardowski, Andrew D. Bagdanov
TL;DR: 本文提出了一种无需训练的小样本图像分类方法,通过混合文本和图像原型,并引入跨模态原型对齐与混合策略,以提升CLIP模型的分类性能。
Details
Motivation: 动机在于利用CLIP模型中图像和文本嵌入的信息,通过混合原型减少噪声并改善小样本分类,同时解决下游数据集跨模态对齐不佳的问题。
Result: 在多个小样本分类基准测试中,结合文本对齐混合原型分类器和图像特定LDA分类器的方法优于现有方法,达到了SOTA水平。
Insight: 创新点包括将图像原型投影到语义文本嵌入空间的主方向以获取文本对齐子空间,以及通过类协方差建模各向异性来利用图像子空间,从而增强分类鲁棒性。
Abstract: Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.
[100] CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition cs.CV | cs.AIPDF
Florian Stilz, Vinkle Srivastav, Nassir Navab, Nicolas Padoy
TL;DR: 本文提出了CliPPER,一种针对长时程手术视频的事件识别而设计的视频-语言预训练框架。该框架通过引入上下文视频-文本对比学习、片段顺序预测、循环一致性对齐和帧-文本匹配等新颖的预训练目标,旨在提升长视频中细粒度时序理解与多模态对齐能力。
Details
Motivation: 动机是解决手术视频领域标注数据稀缺、下游任务(如事件识别)需要精确时序理解的挑战,利用手术教学视频进行预训练以建立有效的视频-语言基础模型。
Result: 在多个公开手术基准测试(包括阶段、步骤、器械和三元组的零样本识别)上达到了新的最先进水平(SOTA)。
Insight: 创新点在于设计了多个利用时序和上下文依赖性的预训练目标(如VTC_CTX和COP),以及引入循环一致性对齐和更精细的帧-文本匹配损失,以增强长视频的局部理解和多模态表示一致性。
Abstract: Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.
[101] SEGAR: Selective Enhancement for Generative Augmented Reality cs.CV | cs.AIPDF
Fanjun Bu, Chenyang Yuan, Hiroshi Yasuda
TL;DR: SEGAR是一个用于增强现实(AR)的生成式世界模型框架,它结合了基于扩散的预测模型和选择性校正阶段,旨在生成并缓存具有视觉编辑的未来帧序列,从而避免实时逐帧渲染。该框架在驾驶场景中进行了演示,能够对特定区域进行编辑,同时保持其他区域不变,并通过校正确保安全关键区域与现实观测对齐。
Details
Motivation: 解决增强现实中生成未来帧时保持时间一致性和实时性的问题,通过预测性生成和缓存来避免实时渲染的开销,并确保安全关键区域的准确性。
Result: 在驾驶场景中作为代表性设置进行了演示,语义区域结构定义明确且现实反馈易于获取,展示了框架能够生成增强的未来帧并选择性校正,但摘要未提及具体定量结果或基准测试。
Insight: 创新点在于将扩散世界模型与选择性校正阶段结合,允许区域特定的编辑同时保持其他部分不变,并通过校正对齐安全关键区域,这为生成式世界模型作为实用AR基础设施提供了早期步骤,可借鉴其预测缓存和选择性增强的思路。
Abstract: Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.
[102] The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series cs.CVPDF
Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mirela Tulbure
TL;DR: 本研究提出了一种利用Sentinel-2年内时间序列数据区分有机和常规农业系统的Vision Transformer方法,并探讨了多任务学习(联合学习作物类型)和空间上下文(通过改变图像块大小)对分类性能的影响。
Details
Motivation: 有机农业是实现可持续农业的关键,但缺乏全面、空间明确的信息;研究旨在利用遥感数据自动区分有机与常规农业系统,并分析多任务学习和空间上下文的作用。
Result: 实验表明,使用多光谱遥感数据区分有机和常规农业系统是可行的,但性能因作物类型而异:冬季黑麦、冬小麦和冬燕麦的F1分数可达0.8以上,而永久草地、果园、葡萄藤和啤酒花等类别的有机管理F1分数低于0.4,难以可靠区分;多任务学习带来的额外收益有限,但引入更宽的空间上下文能提升两类分类任务的性能。
Insight: 创新点包括将Vision Transformer(TSViT架构)应用于农业系统分类,并系统评估了多任务学习和空间上下文的影响;客观分析认为,空间上下文信息的有效利用是提升遥感农业分类精度的关键方向,而多任务学习在本任务中效果不显著,需针对具体场景优化。
Abstract: Organic farming is a key element in achieving more sustainable agriculture. For a better understanding of the development and impact of organic farming, comprehensive, spatially explicit information is needed. This study presents an approach for the discrimination of organic and conventional farming systems using intra-annual Sentinel-2 time series. In addition, it examines two factors influencing this discrimination: the joint learning of crop type information in a concurrent task and the role of spatial context. A Vision Transformer model based on the Temporo-Spatial Vision Transformer (TSViT) architecture was used to construct a classification model for the two farming systems. The model was extended for simultaneous learning of the crop type, creating a multitask learning setting. By varying the patch size presented to the model, we tested the influence of spatial context on the classification accuracy of both tasks. We show that discrimination between organic and conventional farming systems using multispectral remote sensing data is feasible. However, classification performance varies substantially across crop types. For several crops, such as winter rye, winter wheat, and winter oat, F1 scores of 0.8 or higher can be achieved. In contrast, other agricultural land use classes, such as permanent grassland, orchards, grapevines, and hops, cannot be reliably distinguished, with F1 scores for the organic management class of 0.4 or lower. Joint learning of farming system and crop type provides only limited additional benefits over single-task learning. In contrast, incorporating wider spatial context improves the performance of both farming system and crop type classification. Overall, we demonstrate that a classification of agricultural farming systems is possible in a diverse agricultural region using multispectral remote sensing data.
[103] LensWalk: Agentic Video Understanding by Planning How You See in Videos cs.CV | cs.AIPDF
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang
TL;DR: LensWalk是一个代理式视频理解框架,它通过让大型语言模型(LLM)推理器主动控制其视觉观察来解决视频分析中推理与感知脱节的问题。该框架建立了一个紧密的‘推理-规划-观察’循环,使代理能够动态指定每一步观察视频的时间范围和采样密度,从而进行渐进式、按需的证据收集。
Details
Motivation: 解决现有视频理解方法因依赖静态、预处理的视频信息而导致的推理与感知脱节问题,使模型能够像人类一样在理解过程中主动从原始视频中寻找证据。
Result: 在无需微调模型的情况下,LensWalk作为即插即用模块,在LVBench和Video-MME等具有挑战性的长视频基准测试上,为多种模型配方带来了显著的性能提升,将准确率提高了超过5%。
Insight: 核心创新在于赋予代理主动控制其‘观察方式’(即何时、以何种密度看视频)的能力,通过一个由参数化视觉语言模型工具组成的套件,实现了从广泛扫描、聚焦提取到多时刻证据整合的灵活操作,从而解锁了更准确、鲁棒和可解释的视频推理路径。
Abstract: The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent’s evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.
[104] POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan cs.CVPDF
Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kumar Das
TL;DR: 本文介绍了POLY-SIM 2026大挑战的评估计划,该挑战旨在推动缺失模态和跨语言条件下的多模态说话人识别研究。挑战赛提供了一个标准化的基准和评估框架,包括数据集、任务定义、评估协议和基线模型,以促进开发能够有效利用不完整多模态输入并在不同语言中保持高性能的鲁棒方法。
Details
Motivation: 解决现实应用中多模态说话人识别系统面临的挑战,包括视觉信息缺失(如遮挡、相机故障或隐私限制)以及多语言说话人带来的语言变异性,这些因素影响了系统的鲁棒性和泛化能力。
Result: 本文未报告具体模型性能结果,而是介绍了挑战赛的设计和组织,包括提供标准化基准和评估框架,旨在推动该领域的研究进展。
Insight: 创新点在于提出了一个专注于缺失模态和跨语言条件的大挑战,为开发更鲁棒、实用的多模态说话人识别系统提供了明确的研究方向和标准化评估环境,强调了处理现实世界不完整数据和语言多样性的重要性。
Abstract: Multimodal speaker identification systems typically assume the availability of complete and homogeneous audio-visual modalities during both training and testing. However, in real-world applications, such assumptions often do not hold. Visual information may be missing due to occlusions, camera failures, or privacy constraints, while multilingual speakers introduce additional complexity due to linguistic variability across languages. These challenges significantly affect the robustness and generalization of multimodal speaker identification systems. The POLY-SIM Grand Challenge 2026 aims to advance research in multimodal speaker identification under missing-modality and cross-lingual conditions. Specifically, the Grand Challenge encourages the development of robust methods that can effectively leverage incomplete multimodal inputs while maintaining strong performance across different languages. This report presents the design and organization of the POLY-SIM Grand Challenge 2026, including the dataset, task formulation, evaluation protocol, and baseline model. By providing a standardized benchmark and evaluation framework, the challenge aims to foster progress toward more robust and practical multimodal speaker identification systems.
[105] Anti-I2V: Safeguarding your photos from malicious image-to-video generation cs.CV | cs.AIPDF
Duc Vu, Anh Nguyen, Chi Tran, Anh Tran
TL;DR: 本文提出了一种名为Anti-I2V的新颖防御方法,旨在保护个人照片免受基于扩散模型的恶意图像到视频生成技术的滥用。该方法通过在Lab*色彩空间和频域中引入对抗性扰动,并针对去噪过程中捕获关键语义特征的网络层设计训练目标,以破坏生成视频的时间一致性和保真度。
Details
Motivation: 随着扩散模型在视频生成领域的进步,恶意利用个人照片生成虚假视频的风险增加。现有对抗攻击方法主要针对图像生成模型,且多基于UNet架构,对扩散Transformer模型的有效性不足,因此需要一种能跨多种扩散骨干网络、有效防御恶意人像视频生成的方法。
Result: 通过广泛的验证,Anti-I2V在针对多种视频扩散模型的防御性能上达到了最先进水平,有效降低了生成视频的质量和连贯性。
Insight: 创新点在于将对抗扰动操作扩展到Lab*色彩空间和频域以增强鲁棒性,并识别去噪过程中的关键语义层来设计训练目标,从而最大化破坏时间一致性和生成保真度,为跨架构防御提供了新思路。
Abstract: Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person’s photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$$a$$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.
[106] VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models cs.CV | cs.AIPDF
Qijia He, Xunmei Liu, Hammaad Memon, Ziang Li, Zixian Ma
TL;DR: VFIG是一种基于视觉语言模型的复杂图形向量化方法,旨在将栅格图像(如PNG、JPEG)转换为可编辑的SVG格式。该方法通过构建大规模数据集VFIG-DATA(包含66K高质量图形-SVG对),并采用从监督微调到强化学习的渐进式训练策略,优化图形保真度和结构一致性。在VFIG-BENCH评估套件上,VFIG达到开源模型中的最优性能,并与GPT-5.2表现相当。
Details
Motivation: 解决实际中矢量源文件丢失或不可访问时,仅存栅格图像难以修改或缩放的问题,自动化重建复杂图形的SVG表示,减少人工重建的专业负担。
Result: 在VFIG-BENCH评估中,VFIG达到开源模型的SOTA性能,与GPT-5.2表现相当,VLM-Judge得分为0.829。
Insight: 创新点包括构建大规模复杂图形数据集、采用从原子图元学习到全局优化的渐进式训练课程,以及设计专注于结构完整性的评估指标,为图形向量化任务提供了数据驱动和结构感知的解决方案。
Abstract: Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only “flat” rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.
[107] Vision-Language Models vs Human: Perceptual Image Quality Assessment cs.CV | eess.IVPDF
Imran Mehmood, Imad Ali Shah, Ming Ronnier Luo, Brian Deegan
TL;DR: 本文系统评估了六种视觉语言模型(VLMs)在感知图像质量评估(IQA)中模拟人类判断的能力,重点关注对比度、色彩度和整体偏好三个质量维度,并与心理物理学数据进行比较。
Details
Motivation: 心理物理学实验虽可靠但成本高、可扩展性有限,因此研究能否利用VLMs自动化地近似人类对图像质量的感知判断。
Result: 在心理物理学数据基准测试中,VLMs表现出强烈的属性依赖性:在色彩度上与人类对齐度高(ρ高达0.93),但在对比度上表现不佳,反之亦然;大多数VLMs在评估整体偏好时与人类类似,更重视色彩度而非对比度。
Insight: 创新点在于系统揭示了VLMs在感知IQA中的属性特异性与人类对齐的复杂关系,并发现模型自一致性与人类对齐性之间存在反直觉的权衡,且感知可分性越高,人机一致性越强,这为VLMs在质量评估中的可靠性提供了新见解。
Abstract: Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.
[108] TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models cs.CV | cs.ROPDF
Jiaying Zhou, Zhihao Zhan, Ruifeng Zhai, Qinhan Lyu, Hao Liu
TL;DR: 本文提出了一种名为TAG(Target-Agnostic Guidance)的推理时引导机制,旨在解决视觉-语言-动作(VLA)策略在杂乱场景中因实例级接地失败导致的可靠性下降问题。TAG通过对比原始观察和物体擦除观察下的策略预测差异,生成一个残差引导信号,以增强决策过程中物体证据的影响,从而减少干扰物和外观引起的偏差。
Details
Motivation: VLA策略在将语言指令和视觉观察映射到机器人动作方面取得了显著进展,但在存在干扰物的杂乱场景中,其可靠性会下降。分析发现,许多错误并非源于不可行的运动,而是由于实例级接地失败,即策略生成的抓取轨迹可能略微偏离目标或甚至抓取错误物体实例。
Result: 在标准操作基准测试(包括LIBERO、LIBERO-Plus和VLABench)上评估TAG,结果表明它持续提高了在杂乱场景下的鲁棒性,并减少了接近失误和错误物体执行的情况。
Insight: TAG的创新点在于受无分类器引导(CFG)启发,提出了一种无需修改策略架构的推理时引导机制,通过对比不同观察下的预测差异来增强物体证据的权重。这为减少VLA策略中的偏差提供了一种简单且可集成的方法,具有实际应用价值。
Abstract: Vision–Language–Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.
cs.IR [Back]
[109] OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework cs.IR | cs.AI | cs.CLPDF
Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang
TL;DR: OneSearch-V2是一个增强的生成式搜索框架,通过潜在推理和自我蒸馏技术,解决了复杂查询理解不足、用户潜在意图挖掘效率低以及对历史偏好过拟合的问题。它包含三个关键创新模块:思维增强的复杂查询理解、推理内化的自蒸馏训练流程和行为偏好对齐优化系统,在提升商业指标的同时改善了搜索体验质量。
Details
Motivation: 解决现有生成式检索框架OneSearch在复杂查询理解、潜在用户意图挖掘和过度拟合历史偏好方面的局限性,以进一步提升搜索系统的性能。
Result: 离线评估显示其强大的查询识别和用户画像能力;在线A/B测试带来商品点击率提升3.98%、买家转化率提升3.05%和订单量提升2.11%;人工评估确认页面优质率提升1.65%和查询-商品相关性提升1.37%,且未增加推理成本或服务延迟。
Insight: 创新点包括思维增强的深度查询理解模块克服浅层语义匹配、通过隐式上下文学习进行自蒸馏以挖掘精确用户意图,以及行为偏好对齐系统缓解单一转化指标的奖励黑客问题并利用直接用户反馈;客观分析认为其将推理过程内化并整合自蒸馏与对齐优化,是提升生成式搜索鲁棒性和效果的有效架构设计。
Abstract: Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose \textbf{OneSearch-V2}, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users’ potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2’s strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98% item CTR, +3.05% buyer conversion rate, and +2.11% order volume. Manual evaluation further confirms gains in search experience quality, with +1.65% in page good rate and +1.37% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.
cs.AI [Back]
[110] PLDR-LLMs Reason At Self-Organized Criticality cs.AI | cs.CL | cs.LG | nlin.AOPDF
Burc Gokden
TL;DR: 该论文提出PLDR-LLMs在自组织临界点进行预训练后,在推理时展现出推理能力。研究发现,在临界状态下,模型的演绎输出表现出类似于二阶相变的特性,如关联长度发散和达到亚稳态稳态。这种稳态行为表明,演绎输出从训练数据中学习了类似于标度函数、普适性类和重整化群的表示,从而获得泛化和推理能力。通过定义基于模型演绎输出全局统计量的序参数,论文发现当序参数接近零时,PLDR-LLMs的推理能力更强,这一结论得到了在近临界和亚临界状态下训练模型的基准测试结果支持。
Details
Motivation: 论文旨在解释大型语言模型中推理能力的涌现机制,特别是如何通过自组织临界性来理解和量化模型的推理表现,而不依赖于传统基于归纳输出的基准数据集评估。
Result: 在近临界状态下训练的模型在基准测试中表现更好,支持了序参数接近零时推理能力更强的观察,但未具体提及基准名称或是否达到SOTA水平。
Insight: 创新点在于将自组织临界性理论应用于LLMs的推理能力分析,提出通过演绎输出的全局参数(如序参数)来量化推理能力,这为理解模型内部表示(如标度函数和重整化群)与泛化能力的关系提供了新视角。
Abstract: We show that PLDR-LLMs pretrained at self-organized criticality exhibit reasoning at inference time. The characteristics of PLDR-LLM deductive outputs at criticality is similar to second-order phase transitions. At criticality, the correlation length diverges, and the deductive outputs attain a metastable steady state. The steady state behaviour suggests that deductive outputs learn representations equivalent to scaling functions, universality classes and renormalization groups from the training dataset, leading to generalization and reasoning capabilities in the process. We can then define an order parameter from the global statistics of the model’s deductive output parameters at inference. The reasoning capabilities of a PLDR-LLM is better when its order parameter is close to zero at criticality. This observation is supported by the benchmark scores of the models trained at near-criticality and sub-criticality. Our results provide a self-contained explanation on how reasoning manifests in large language models, and the ability to reason can be quantified solely from global model parameter values of the deductive outputs at steady state, without any need for evaluation of curated benchmark datasets through inductive output for reasoning and comprehension.
[111] Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA cs.AI | cs.CL | cs.LGPDF
John Ray B. Martinez
TL;DR: 本文提出了一种多智能体推理框架,通过结合领域专家智能体、两阶段验证和S分数加权融合,旨在改善医学多选题问答中的置信度校准和判别能力。该方法在MedQA-USMLE和MedMCQA数据集的高分歧子集上显著降低了预期校准误差(ECE),提升了不确定性估计的可靠性。
Details
Motivation: 解决AI模型在临床部署中置信度分数校准不佳的问题,特别是模型过度自信导致无法为临床转诊决策提供有用信号,从而提升安全关键应用中的实用性。
Result: 在四个实验设置(MedQA和MedMCQA的100题和250题高分歧子集)上,预期校准误差(ECE)降低了49%至74%。在MedQA-250上,完整系统达到ECE=0.091(比单专家基线降低74.4%),AUROC=0.630(提升0.056),准确率为59.2%。两阶段验证是校准提升的主要驱动力,多智能体推理是准确率提升的主要驱动力。
Insight: 创新点在于将多智能体专业诊断与基于一致性的两阶段自我验证(产生专家置信度分数)相结合,并通过加权融合策略进行最终决策和置信度校准。该方法为安全关键的临床AI应用提供了更可靠的不确定性估计和实用的转诊置信信号。
Abstract: Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.
cs.CR [Back]
[112] CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training cs.CR | cs.AI | cs.CVPDF
Yuxi Chen, Haoyu Zhai, Chenkai Wang, Rui Yang, Lingming Zhang
TL;DR: 本文提出ReCAP,一种能够解决现代交互式验证码挑战的原生GUI智能体,同时保持通用GUI任务性能。通过构建涵盖七类验证码的动态系统,开发自动化数据生成流程,并利用失败轨迹构建自我纠正数据,实现从约30%到80%的验证码解决成功率提升。
Details
Motivation: 现有原生视觉语言模型(VLM)GUI智能体难以解决验证码任务,而专用验证码解决流程又无法处理通用GUI任务,因此需要开发兼具验证码解决能力和通用性的GUI智能体。
Result: 在保留测试集上,ReCAP将验证码解决成功率从约30%提升至80%,同时在通用GUI智能体基准测试中保持强劲性能。
Insight: 创新点包括:构建动态验证码系统以测试核心能力;自动化生成带推理轨迹的大规模交互数据;利用失败轨迹构建自我纠正训练机制,使智能体能够在线反思错误并修正动作。
Abstract: GUI agents are rapidly shifting from multi-module pipelines to end-to-end, native vision-language models (VLMs) that perceive raw screenshots and directly interact with digital devices. Despite rapid progress on general GUI tasks, CAPTCHA solving remains a major challenge. On the other hand, although specialized CAPTCHA solving pipelines exist, they cannot handle general GUI tasks. To address this gap, we introduce ReCAP: a CAPTCHA-capable native GUI agent that can robustly solve modern, interactive CAPTCHA challenges, while preserving their performance as a general GUI agent. We first develop a dynamic CAPTCHA system spanning seven representative CAPTCHA types, designed to stress primitive and complementary capabilities for CAPTCHA solving (e.g., robust OCR under heavy noise and text stylization, fine-grained visual understanding, and precise control). Then, we develop an automated data collection and curation pipeline that generates large-scale CAPTCHA interaction trajectories paired with reasoning traces. As CAPTCHA solving often requires multi-step interaction and recovery from intermediate mistakes, we further leverage failed trajectories to construct self-correction data, training agents to reflect on errors and correct their actions online. Across held-out test sets, ReCAP improves CAPTCHA-solving success from roughly 30% to 80%, while maintaining strong performance on general GUI-agent benchmarks.
eess.IV [Back]
[113] Comparative analysis of dual-form networks for live land monitoring using multi-modal satellite image time series eess.IV | cs.AI | cs.CVPDF
Iris Dumeur, Jérémy Anger, Gabriele Facciolo
TL;DR: 本文研究了用于多模态卫星图像时间序列实时土地监测的双形式注意力机制,通过比较线性注意力和保留机制,并针对时间不规则性进行适应性调整,实现了并行训练和增量处理的循环推理。
Details
Motivation: 解决多模态卫星图像时间序列分析中Transformer架构二次计算复杂度和全序列重处理限制,以支持大规模实时土地监测应用。
Result: 在Sentinel-1和Sentinel-2数据的多模态SITS预测和太阳能电池板建设监测任务中,双形式机制达到与标准Transformer相当的性能,且多模态框架始终优于单模态方法。
Insight: 创新点包括基于实际采集日期而非序列索引计算令牌距离的时间适应性双形式机制,以及支持高效循环推理的并行训练架构,为大规模地理区域定期更新的土地监测系统提供新机遇。
Abstract: Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.
[114] Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic eess.IV | cs.CV | q-bio.NCPDF
Wanying Qu, Jianxiong Gao, Wei Wang, Yanwei Fu
TL;DR: 本文提出了一种基于脑电图(EEG)条件化的框架,用于从EEG信号重建高空间保真度和强时间一致性的动态功能磁共振成像(fMRI)序列,通过引入零空间中间帧重建技术处理实际fMRI采集中的采样不规则问题。
Details
Motivation: 解决fMRI采集成本高、难以大规模应用的问题,利用EEG的毫秒级时间线索与fMRI的高空间分辨率互补,以实现高质量、连续的动态fMRI重建。
Result: 在CineBrain数据集上的实验显示,该方法在全脑和功能特定区域均实现了优越的体素级重建质量和鲁棒的时间一致性,重建的fMRI保留了关键功能信息,支持下游视觉解码任务。
Insight: 创新点在于EEG条件化的动态fMRI重建框架,以及零空间中间帧重建技术,这提升了序列连续性和实际适用性,为从EEG估计高分辨率fMRI动态提供了新途径,推动了多模态神经成像向更动态的脑活动建模发展。
Abstract: Capturing dynamic spatiotemporal neural activity is essential for understanding large-scale brain mechanisms. Functional magnetic resonance imaging (fMRI) provides high-resolution cortical representations that form a strong basis for characterizing fine-grained brain activity patterns. The high acquisition cost of fMRI limits large-scale applications, therefore making high-quality fMRI reconstruction a crucial task. Electroencephalography (EEG) offers millisecond-level temporal cues that complement fMRI. Leveraging this complementarity, we present an EEG-conditioned framework for reconstructing dynamic fMRI as continuous neural sequences with high spatial fidelity and strong temporal coherence at the cortical-vertex level. To address sampling irregularities common in real fMRI acquisitions, we incorporate a null-space intermediate-frame reconstruction, enabling measurement-consistent completion of arbitrary intermediate frames and improving sequence continuity and practical applicability. Experiments on the CineBrain dataset demonstrate superior voxel-wise reconstruction quality and robust temporal consistency across whole-brain and functionally specific regions. The reconstructed fMRI also preserves essential functional information, supporting downstream visual decoding tasks. This work provides a new pathway for estimating high-resolution fMRI dynamics from EEG and advances multimodal neuroimaging toward more dynamic brain activity modeling.
cs.RO [Back]
[115] Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis cs.RO | cs.CVPDF
Dayou Li, Jiuzhou Lei, Hao Wang, Lulin Liu, Yunhao Yang
TL;DR: 该论文提出了Dream2Fix框架,旨在解决机器人操作中从执行错误中自主恢复的难题。该框架通过生成式世界模型,直接从真实世界的成功演示中合成逼真的反事实失败推演,从而生成大量成对的失败-纠正数据,用于训练视觉语言模型来联合预测失败类型和精确的恢复轨迹,实现从视觉异常到纠正动作的直接映射。
Details
Motivation: 当前基于基础模型的机器人操作系统在自主恢复执行错误方面存在不足,现有失败学习范式依赖成本高、不安全的真实数据收集或存在严重模拟到现实差距的模拟器扰动,且现有视觉分析器主要输出粗略的二元诊断,而非实际恢复所需的可执行的轨迹级纠正。
Result: 广泛的真实世界机器人实验表明,该方法在纠正准确率上达到了最先进水平,将基线方法的准确率从19.7%提升至81.3%,并成功在物理部署中实现了零样本的闭环失败恢复。
Insight: 创新点在于提出了一个不依赖模拟器的框架,通过生成式世界模型合成物理可行的反事实失败数据,并引入结构化验证机制确保数据质量,从而弥合了失败诊断与可执行恢复之间的差距,实现了从视觉感知到恢复动作的端到端学习。
Abstract: While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real-world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure-correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high-fidelity dataset of over 120k paired samples. Using this dataset, we fine-tune a vision-language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real-world robotic experiments show our approach achieves state-of-the-art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero-shot closed-loop failure recovery in physical deployments.
[116] Bio-Inspired Event-Based Visual Servoing for Ground Robots cs.RO | cs.CVPDF
Maral Mordad, Kian Behzad, Debojyoti Biswas, Noah J. Cowan, Milad Siami
TL;DR: 本文受生物主动感知行为启发,提出了一种用于地面机器人的新型事件相机视觉伺服框架。该方法利用动态视觉传感器(DVS)的事件流,通过固定空间核处理对数强度变化模式,解析出特定运动状态,并设计了一个生物启发的主动感知极限环控制器来解决平衡点处的线性可观测性丢失问题。实验验证了该方法的有效性、极低延迟和计算效率。
Details
Motivation: 受生物感觉系统能过滤恒定刺激、优先处理相对变化以提升计算和代谢效率的启发,旨在为地面机器人开发一种高效、低延迟的视觉伺服方法,解决传统方法在状态估计和平衡点可观测性方面的挑战。
Result: 在1/10比例自主地面车辆上的实验验证证实了所提直接感知方法的有效性、极低延迟和计算效率。
Insight: 创新点在于利用事件流和固定空间核直接解析运动状态,无需传统状态估计,并引入生物启发的主动感知极限环控制器来解决事件感知固有的平衡点可观测性丢失问题,实现了高效、低延迟的伺服控制。
Abstract: Biological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper presents a novel event-based visual servoing framework for ground robots. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot’s velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state-feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.
[117] Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation cs.RO | cs.AI | cs.CVPDF
Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao
TL;DR: 本文提出Chameleon,一种受人类情景记忆启发的机器人记忆系统,通过几何基础的多模态令牌存储去歧化上下文,并利用可微分记忆栈实现目标导向的回忆,以解决机器人操作中因遮挡和状态变化导致的感知混淆问题。
Details
Motivation: 解决机器人操作中因感知混淆(同一观测可能对应不同交互历史)导致的非马尔可夫决策问题,现有基于语义压缩和相似性检索的记忆方法会丢失细粒度感知线索并可能召回无关情景。
Result: 在真实机器人UR5e数据集Camo-Dataset(涵盖情景回忆、空间跟踪和感知混淆下的序列操作任务)上,Chameleon在感知混淆环境中持续提升了决策可靠性和长时程控制性能,优于强基线方法。
Insight: 创新点包括:采用几何基础的多模态令牌保留去歧化上下文,引入可微分记忆栈实现目标导向回忆,以及构建了涵盖感知混淆场景的真实机器人数据集;从客观角度看,该方法将细粒度感知与记忆检索相结合,有望提升机器人在复杂环境中的长期操作能力。
Abstract: Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.
physics.optics [Back]
[118] Machine vision with small numbers of detected photons per inference physics.optics | cs.CV | cs.ET | cs.LG | physics.data-anPDF
Shi-Yuan Ma, Jérémie Laydevant, Mandar M. Sohoni, Logan G. Wright, Tianyu Wang
TL;DR: 该论文提出了一种名为光子感知神经形态感知(PANS)的端到端优化方法,用于解决极低光照条件下(每个像素平均光子数接近或小于1)的机器视觉任务,如物体识别和图像重建。该方法在训练中结合了低光子预算和光探测的随机性知识,通过原理性实验证明,在FashionMNIST和MNIST数据集上,仅使用极少量检测到的光子(分别为4.9和8.6个光子)即可实现高分类准确率(73%和86%),比传统方法的光子效率高出数个数量级。
Details
Motivation: 当前机器视觉系统在中等或强光照下表现优异,但在极低光照条件下(每个像素平均光子数接近或小于1)面临巨大挑战,因为传统方法无法有效处理光子极度匮乏和探测随机性问题。
Result: 在FashionMNIST数据集上,使用PANS方法,平均每次推理仅检测到4.9个光子时达到73%准确率(17个光子时达82%);在MNIST数据集上,平均8.6个光子时达到86%准确率(29个光子时达97%),光子效率比传统方法高出多个数量级。
Insight: 创新点在于将低光子预算和光探测的统计特性(泊松噪声)直接整合到端到端优化框架中,实现了在光子极度匮乏场景下的高效机器视觉;该方法可扩展至量子传感等非经典态或其他光子匮乏的硬件设置,具有广泛适用性。
Abstract: Machine vision, including object recognition and image reconstruction, is a central technology in many consumer devices and scientific instruments. The design of machine-vision systems has been revolutionized by the adoption of end-to-end optimization, in which the optical front end and the post-processing back end are jointly optimized. However, while machine vision currently works extremely well in moderate-light or bright-light situations – where a camera may detect thousands of photons per pixel and billions of photons per frame – it is far more challenging in very low-light situations. We introduce photon-aware neuromorphic sensing (PANS), an approach for end-to-end optimization in highly photon-starved scenarios. The training incorporates knowledge of the low photon budget and the stochastic nature of light detection when the average number of photons per pixel is near or less than 1. We report a proof-of-principle experimental demonstration in which we performed low-light image classification using PANS, achieving 73% (82%) accuracy on FashionMNIST with an average of only 4.9 (17) detected photons in total per inference, and 86% (97%) on MNIST with 8.6 (29) detected photons – orders of magnitude more photon-efficient than conventional approaches. We also report simulation studies showing how PANS could be applied to other classification, event-detection, and image-reconstruction tasks. By taking into account the statistics of measurement results for non-classical states or alternative sensing hardware, PANS could in principle be adapted to enable high-accuracy results in quantum and other photon-starved setups.
cs.LG [Back]
[119] The Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations cs.LG | cs.CL | cs.CYPDF
Long Zhang, Dai-jun Lin, Wei-neng Chen
TL;DR: 本文研究了大型语言模型(LLMs)中连续语义空间与离散逻辑推理之间的根本矛盾。论文提出,任务上下文作为一种非等距动态算子,通过引发必要的‘拓扑扭曲’来驱动离散决策边界的形成。通过应用Gram-Schmidt分解分析残差流激活,作者揭示了一个双调制机制:一个类别无关的拓扑保持用于锚定全局结构防止语义崩溃,以及一个特定的代数发散用于方向性地撕裂跨类别概念以形成逻辑边界。该几何演化在从简单映射到复杂素数测试的一系列任务中得到验证。向量消融实验严格证明了该拓扑结构与模型功能之间的因果绑定。研究还发现了一个三层几何动态,并证明在社会压力提示下,模型无法产生足够的发散,导致‘流形纠缠’,从而从几何角度解释了谄媚和幻觉现象。最终,研究修正了线性等距假设,表明LLMs中离散逻辑的出现是以不可约的拓扑变形为代价的。
Details
Motivation: 解决LLMs在连续语义空间中平滑泛化与执行严格离散逻辑推理所需形成离散决策边界之间的根本矛盾,挑战了依赖线性等距投影的流行理论。
Result: 在从简单映射到复杂素数测试的任务梯度上验证了几何演化机制。通过有针对性的向量消融实验,严格证明了拓扑结构与模型功能的因果绑定:代数擦除发散分量会使奇偶分类准确率从100%降至机会水平(38.57%)。研究还揭示了模型在社会压力提示下因发散不足而产生‘流形纠缠’,从几何上解释了谄媚和幻觉。
Insight: 核心创新点在于提出了任务上下文作为非等距动态算子驱动‘拓扑扭曲’的双调制机制(拓扑保持与代数发散),以几何视角解释了离散逻辑的形成及其代价(如幻觉)。这修正了关于LLMs内部表示的线性等距假设,为理解模型推理、泛化失败(如谄媚)提供了新的理论框架和因果实验方法(向量消融)。
Abstract: Large language models (LLMs) generalize smoothly across continuous semantic spaces, yet strict logical reasoning demands the formation of discrete decision boundaries. Prevailing theories relying on linear isometric projections fail to resolve this fundamental tension. In this work, we argue that task context operates as a non-isometric dynamical operator that enforces a necessary “topological distortion.” By applying Gram-Schmidt decomposition to residual-stream activations , we reveal a dual-modulation mechanism driving this process: a class-agnostic topological preservation that anchors global structure to prevent semantic collapse, and a specific algebraic divergence that directionally tears apart cross-class concepts to forge logical boundaries. We validate this geometric evolution across a gradient of tasks, from simple mapping to complex primality testing. Crucially, targeted specific vector ablation establishes a strict causal binding between this topology and model function: algebraically erasing the divergence component collapses parity classification accuracy from 100% to chance levels (38.57%). Furthermore, we uncover a three-phase layer-wise geometric dynamic and demonstrate that under social pressure prompts, models fail to generate sufficient divergence. This results in a “manifold entanglement” that geometrically explains sycophancy and hallucination. Ultimately, our findings revise the linear-isometric presumption, demonstrating that the emergence of discrete logic in LLMs is purchased at an irreducible cost of topological deformation.
[120] Can VLMs Reason Robustly? A Neuro-Symbolic Investigation cs.LG | cs.AI | cs.CVPDF
Weixin Chen, Antonio Vergari, Han Zhao
TL;DR: 本文研究了视觉语言模型(VLMs)在分布偏移下的鲁棒推理能力,发现传统微调方法在协变量偏移下泛化能力差,因此提出了一种名为VLC的神经符号方法,该方法将VLM的概念识别与基于电路的符号推理相结合,在三个视觉演绎推理任务上实现了对分布偏移的鲁棒性。
Details
Motivation: 动机在于探究VLMs在感知输入分布发生变化(协变量偏移)而底层预测规则不变时,是否仍能进行鲁棒推理,并解决现有微调方法和依赖黑盒组件的神经符号方法在此类偏移下表现不一致的问题。
Result: 在三个具有不同规则集的视觉演绎推理任务上的实验表明,所提出的VLC方法在协变量偏移下始终能取得强劲性能,展现了其支持鲁棒推理的能力。
Insight: 论文的创新点在于提出了一种将VLM感知与精确的符号电路执行相解耦的神经符号框架VLC,其核心洞察是,通过将任务规则编译成可精确执行的符号程序(电路),可以确保推理过程的确定性,从而在分布变化下实现一致的鲁棒性,这为解决VLMs的脆弱推理问题提供了一条可借鉴的路径。
Abstract: Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can reason robustly under distribution shifts. In this paper, we study covariate shifts in which the perceptual input distribution changes while the underlying prediction rules do not. To investigate this question, we consider visual deductive reasoning tasks, where a model is required to answer a query given an image and logical rules defined over the object concepts in the image. Empirically, we find that VLMs fine-tuned through gradient-based end-to-end training can achieve high in-distribution accuracy but fail to generalize under such shifts, suggesting that fine-tuning does not reliably induce the underlying reasoning function. This motivates a neuro-symbolic perspective that decouples perception from reasoning. However, we further observe that recent neuro-symbolic approaches that rely on black-box components for reasoning can still exhibit inconsistent robustness across tasks. To address this issue, we propose VLC, a neuro-symbolic method that combines VLM-based concept recognition with circuit-based symbolic reasoning. In particular, task rules are compiled into a symbolic program, specifically a circuit, which executes the rules exactly over the object concepts recognized by the VLM. Experiments on three visual deductive reasoning tasks with distinct rule sets show that VLC consistently achieves strong performance under covariate shifts, highlighting its ability to support robust reasoning.
[121] CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents cs.LG | cs.AI | cs.CVPDF
Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li
TL;DR: 本文介绍了CUA-Suite,这是一个用于计算机使用代理(CUAs)的大规模生态系统,包含专家视频演示和密集标注。其核心是VideoCUA,提供了约10,000个人类演示任务、55小时连续屏幕录像、光标轨迹和多层推理标注。此外,还提供了UI-Vision基准测试和GroundCUA大规模标注数据集,以支持代理的评估和训练。
Details
Motivation: 当前通用计算机使用代理的发展受限于缺乏连续、高质量的人类演示视频数据,现有最大开源数据集ScaleCUA仅包含约20小时等效视频,不足以支撑代理的规模化训练。
Result: 初步评估显示,当前的基础动作模型在专业桌面应用上的任务失败率高达约60%,突显了现有模型的局限性。CUA-Suite提供了用于评估的UI-Vision基准和包含360万UI元素标注的GroundCUA数据集。
Insight: 创新点在于提供了大规模、连续帧率(30 fps)的专家演示视频,完整保留了人机交互的时序动态,这比仅包含最终点击坐标的稀疏数据集信息更丰富。该数据集可无损转换为现有代理框架所需格式,并支持屏幕解析、连续空间控制、基于视频的奖励建模和视觉世界模型等新兴研究方向。
Abstract: Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite’s rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.
[122] UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience cs.LG | cs.AI | cs.CVPDF
Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao
TL;DR: 本文提出了UI-Voyager,一个两阶段自演化的移动图形用户界面(GUI)智能体,旨在解决现有方法在长视野GUI任务中从失败轨迹学习效率低下和稀疏奖励下信用分配模糊的问题。
Details
Motivation: 现有基于多模态大语言模型(MLLMs)的自主移动GUI智能体在长任务中学习失败经验效率低,且稀疏奖励下难以进行精确的信用分配。
Result: 在AndroidWorld基准测试上,其4B参数的模型实现了81.0%的Pass@1成功率,超越了多个近期基线并超过了人类水平性能。
Insight: 创新点在于提出了一种完全自主的数据与模型协同进化循环(RFT),以及通过组相对自蒸馏(GRSD)从成功轨迹中构建密集的步骤级监督来修正失败轨迹,实现了无需昂贵人工标注的高效自演化GUI自动化。
Abstract: Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.