Table of Contents
- cs.CL [Total: 14]
- cs.CV [Total: 54]
- cs.RO [Total: 3]
- cs.LG [Total: 5]
- eess.IV [Total: 1]
- cs.MM [Total: 1]
- q-fin.CP [Total: 1]
- cs.AI [Total: 2]
- cs.IR [Total: 1]
cs.CL [Back]
[1] T5Gemma 2: Seeing, Reading, and Understanding Longer cs.CLPDF
Biao Zhang, Paul Suganthan, Gaël Liu, Ilya Philippov, Sahil Dua
TL;DR: 本文介绍了T5Gemma 2,这是T5Gemma系列轻量级开放编码器-解码器模型的新一代,具备强大的多语言、多模态和长上下文处理能力。它基于Gemma 3模型,将仅解码器模型通过UL2方法适配为编码器-解码器架构,并扩展到多模态领域。论文提出了两种提升效率的方法:共享编码器和解码器的词嵌入,以及将解码器的自注意力和交叉注意力合并为单一联合模块。实验证明了该适配策略在不同架构和模态上的通用性,以及编码器-解码器架构在长上下文建模中的独特优势。
Details
Motivation: 旨在扩展T5Gemma系列,将仅解码器的预训练模型适配为编码器-解码器架构,并引入多模态能力,同时提升模型效率,以更好地处理长上下文任务。
Result: 实验表明,T5Gemma 2在预训练性能上与Gemma 3相当或更好,在后训练性能上显著提升。模型在长上下文建模中展现出独特优势,但未提及具体基准测试或SOTA比较。
Insight: 创新点包括:将仅解码器模型通过UL2适配为编码器-解码器架构并扩展到多模态;提出共享词嵌入和合并注意力模块以提高效率。从客观角度看,这种适配策略展示了跨架构和模态的通用性,为轻量级多模态模型设计提供了新思路。
Abstract: We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma – adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.
[2] Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models cs.CL | cs.AI | cs.LGPDF
George-Andrei Dima, Dumitru-Clementin Cercel
TL;DR: 本文聚焦于低资源语言罗马尼亚语,通过翻译Flickr30k数据集并扩展为视觉问答数据集,使用LoRA方法高效微调开源视觉语言模型(如LLaMA 3.2、LLaVA 1.6和Qwen2),提升了模型在罗马尼亚语视觉问答和图像描述生成任务上的性能。
Details
Motivation: 解决罗马尼亚语在多模态自然语言处理中的资源匮乏问题,推动生成式AI的民主化。
Result: 在罗马尼亚语视觉问答任务上,模型性能显著提升,其中70亿参数的Qwen2-VL-RoVQA在BERTScore F1上分别比原版提高了6.05%和2.61%,并在未训练的罗马尼亚语图像描述生成任务上表现良好,语法错误大幅减少。
Insight: 创新点在于为低资源语言构建多模态数据集,并结合参数高效的LoRA方法进行微调,有效提升了模型在特定语言上的理解和流畅性,同时展示了跨任务泛化能力。
Abstract: Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.
[3] Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation cs.CL | cs.LGPDF
Buu Phan, Ashish Khisti, Karen Ullrich
TL;DR: 本文提出了一种用于解决语言模型蒸馏中师生模型使用不同分词器时词汇表不对齐问题的跨分词器似然评分算法。该方法基于字节对编码(BPE)的隐式递归结构,构建了一个概率框架,能够在学生词汇表是教师词汇表子集或任意词汇表两种情况下,计算序列似然并支持顺序采样。
Details
Motivation: 标准知识蒸馏需要师生模型共享相同的概率空间,当部署到边缘设备需要更小的词汇表以降低内存开销时,师生模型使用不同分词器会导致词汇表不对齐,使得计算下一个词元的似然比变得困难。
Result: 在子集情况下,该方法能以每词元O(1)的模型评估计算精确似然,用于蒸馏时,使Qwen2.5-1.5B模型的内存占用减少高达12%,并在评估任务上使基线性能提升高达4%。在通用情况下,该方法应用于数学推理蒸馏,在GSM8K数据集上的准确率比当前最先进水平提高了2%以上。
Insight: 创新点在于揭示了BPE算法的隐式递归结构,并利用它构建了一个用于跨分词器似然评分的概率框架。这为处理不同词汇表对齐问题提供了严格的、无损的解决方案和实用的快速近似,显著提升了蒸馏效率和模型性能。
Abstract: Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. In this work, we address this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only O(1) model evaluations per token. When used for distillation, this yields up to a 12% reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to 4% on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to distillation for mathematical reasoning, our approach improves GSM8K accuracy by more than 2% over the current state of the art.
[4] Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams cs.CL | cs.AI | cs.CVPDF
Yiming Cui, Xin Yao, Yuxuan Qin, Xin Li, Shijin Wang
TL;DR: 本文系统评估了40个专有和开源多模态大语言模型在化学奥林匹克竞赛风格问题上的表现,发现许多模型在多模态融合方面存在困难,甚至移除图像有时能提高准确率,而思维链提示能持续提升准确性和视觉基础。
Details
Motivation: 解决多模态大语言模型在化学等科学领域进行多模态推理(结合符号图、分子结构和视觉数据)时面临的重大挑战,特别是在整合视觉和文本信息方面的不足。
Result: 在基于20多年美国国家化学奥林匹克竞赛题目构建的基准测试上,评估了包括GPT-5、o3、Gemini-2.5-Pro和Qwen2.5-VL等模型,发现模型在模态融合上表现不佳,但思维链提示能提升准确性和视觉基础。
Insight: 论文的创新点在于构建了一个针对化学多模态推理的基准测试,并通过系统评估揭示了当前MLLMs在科学推理中的关键局限,提出了通过思维链提示等策略来开发更鲁棒和可解释的多模态系统,为领域特定多模态AI的进展提供了及时基准。
Abstract: Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.
[5] DASH: Dialogue-Aware Similarity and Handshake Recognition for Topic Segmentation in Public-Channel Conversations cs.CL | cs.IRPDF
Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu
TL;DR: 本文提出DASH-DTS,一种基于大语言模型(LLM)的对话主题分割新框架,用于处理海事VHF等公共信道对话中的非正式语言和隐式主题转换问题。其核心创新包括通过对话握手识别检测主题转移、利用相似性引导的示例选择进行上下文增强,以及生成选择性正负样本来提升模型判别力和鲁棒性。论文还发布了首个真实世界海事VHF通信数据集VHF-Dial。
Details
Motivation: 解决传统方法在面向任务的公共信道对话(如海事VHF对话)中,因非正式语言和隐式主题转换而存在的局限性,以提升对话主题分割(DTS)的准确性和实用性。
Result: 在VHF-Dial数据集和标准基准测试上,DASH-DTS取得了多项最先进(SOTA)的分割可信准确率,为操作对话的稳定监控和决策支持奠定了坚实基础。
Insight: 创新点在于将对话握手识别与相似性引导的上下文增强相结合,并引入选择性样本生成策略,增强了模型对隐式主题转移的识别能力和鲁棒性;同时,发布首个真实海事VHF数据集填补了领域空白,具有重要研究价值。
Abstract: Dialogue Topic Segmentation (DTS) is crucial for understanding task-oriented public-channel communications, such as maritime VHF dialogues, which feature informal speech and implicit transitions. To address the limitations of traditional methods, we propose DASH-DTS, a novel LLM-based framework. Its core contributions are: (1) topic shift detection via dialogue handshake recognition; (2) contextual enhancement through similarity-guided example selection; and (3) the generation of selective positive and negative samples to improve model discrimination and robustness. Additionally, we release VHF-Dial, the first public dataset of real-world maritime VHF communications, to advance research in this domain. DASH-DTS provides interpretable reasoning and confidence scores for each segment. Experimental results demonstrate that our framework achieves several sota segmentation trusted accuracy on both VHF-Dial and standard benchmarks, establishing a strong foundation for stable monitoring and decision support in operational dialogues.
[6] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification cs.CL | cs.AIPDF
Hongbo Wang, MaungMaung AprilPyone, Isao Echizen
TL;DR: 本文提出SGM方法,通过神经元级别的干预为多模态大语言模型提供安全防护,类似于为有毒神经元戴上安全眼镜。该方法选择性地重新校准一小部分有毒专家神经元,通过基于专业知识的软抑制来中和有害的跨模态激活,无需参数更新。
Details
Motivation: 多模态大语言模型继承了预训练语料中的有毒、偏见和NSFW信号,带来安全风险,尤其是在对抗性触发条件下,现有的免训练去毒方法难以有效处理。
Result: 在开源MLLMs上的实验表明,SGM在标准和对抗性条件下均能有效降低毒性,将有害率从48.2%降至2.5%,同时保持流畅性和多模态推理能力。
Insight: 创新点在于提出了一种白盒、神经元级别的多模态干预方法,通过软抑制有毒专家神经元来实现去毒,具有可解释性、低成本且可扩展,并能与现有方法结合提升安全性能。
Abstract: Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2% to 2.5% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.
[7] Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning cs.CLPDF
Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang
TL;DR: 本文提出了一种名为SCOPE的框架,用于改进测试时强化学习中的奖励信号生成。该方法通过整合步骤级置信度和动态子群划分,替代传统的多数投票策略,以缓解确认偏差和稀疏奖励问题,从而为大型语言模型提供更精细、更可靠的伪标签监督。
Details
Motivation: 传统测试时强化学习依赖多数投票结果作为伪标签,这容易导致确认偏差和奖励稀疏,限制了模型性能的提升。本文旨在解决这些问题,为LLM推理能力的改进提供更优质的奖励信号。
Result: 在多个模型和基准测试上的实验表明,SCOPE一致优于近期基线方法。具体而言,在具有挑战性的AIME 2025和AMC基准上分别取得了13.1%和8.1%的相对性能提升。
Insight: 主要创新点在于将模型置信度(步骤级置信度加权)和探索多样性(通过动态子群划分与局部共识)引入伪标签估计过程,从而生成更精细、更多样化的监督信号,鼓励模型进行更广泛的探索,而非单纯依赖输出频率。
Abstract: Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1% on challenging AIME 2025 and 8.1% on AMC. The code is released at \href{https://github.com/szu-tera/SCOPE}{https://github.com/szu-tera/SCOPE}.
[8] MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers cs.CL | cs.AIPDF
Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, Chao Yang
TL;DR: 本文提出了MCP-SafetyBench,一个用于评估大型语言模型在真实世界MCP服务器环境中安全性的综合基准。该基准基于真实的MCP服务器,支持跨浏览器自动化、金融分析、位置导航、仓库管理和网络搜索五个领域的多轮次、现实评估,并包含20种MCP攻击类型的统一分类。
Details
Motivation: 随着LLMs演变为能够推理、规划和操作外部工具的智能体系统,Model Context Protocol作为连接LLMs与异构工具/服务的标准化接口,其开放性和多服务器工作流引入了新的安全风险,而现有基准无法捕捉这些风险,因为它们侧重于孤立攻击或缺乏真实世界覆盖。
Result: 使用MCP-SafetyBench对领先的开源和闭源LLMs进行了系统评估,结果显示安全性能存在巨大差异,并且随着任务范围和服务器交互的增加,漏洞会升级。该基准为诊断和缓解真实世界MCP部署中的安全风险奠定了基础。
Insight: 创新点在于构建了一个基于真实MCP服务器的、支持多领域多轮次评估和复杂攻击场景(包括多步推理和跨服务器协调)的综合安全基准。这弥补了现有基准在现实覆盖和复杂工作流安全评估方面的不足,为LLM智能体系统的安全研究提供了更贴近实际部署的评估工具。
Abstract: Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP’s openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.
[9] RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA cs.CL | cs.AIPDF
Chao Zhang, Minghan Li, Tianrui Lv, Guodong Zhou
TL;DR: 本文提出RFKG-CoT方法,旨在解决大语言模型在知识密集型问答中因参数知识限制而产生的幻觉问题。该方法通过引入关系驱动的自适应跳数选择器和基于思维链的少样本路径指导机制,动态调整知识图谱推理步骤并增强模型对推理路径的理解,从而提升问答的准确性和可靠性。
Details
Motivation: 现有方法如KG-CoT虽通过集成知识图谱路径提高可靠性,但存在跳数选择僵化(仅基于问题驱动)和推理路径利用不足(缺乏指导)的问题,因此需要更灵活、自适应的知识图谱集成策略。
Result: 在四个KGQA基准测试上的实验表明,RFKG-CoT相比KG-CoT显著提升准确率,最高达14.7个百分点(如在WebQSP上使用Llama2-7B模型),消融实验证实跳数选择器和路径提示机制具有互补性。
Insight: 创新点包括关系驱动的自适应跳数选择器(通过关系掩码动态调整推理步骤)和少样本上下文学习路径指导机制(以“问题-路径-答案”格式构建示例),这些设计将知识图谱证据转化为更可靠的答案,可借鉴于知识增强的推理任务中。
Abstract: Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct “brother” relations, 2-hop for indirect “father-son” chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a “question-paths-answer” format to enhance LLMs’ ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.
[10] The Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres cs.CLPDF
Maria Becker, Mirko Sommer, Lars Tapken, Yi Wan Teh, Bruno Brocai
TL;DR: 该论文提出了Moralization Corpus,这是一个新颖的多体裁德语数据集,用于分析道德价值观在论证性话语中的战略性使用。论文开发了一种基于框架的标注方案,以捕捉道德化言论的构成要素,并评估了多种大型语言模型在道德化检测和成分提取任务上的表现。
Details
Motivation: 道德化言论是一种尚未被充分探索的说服性沟通形式,其语用复杂且常隐含,对人工标注和NLP系统都构成挑战。论文旨在通过构建一个标注数据集和分析框架,促进对道德话语和道德推理的跨学科研究。
Result: 论文评估了多种提示条件下的大型语言模型在道德化检测和成分提取任务上的表现,结果表明详细的提示指令比少样本或基于解释的提示效果更好,并且道德化任务仍然具有高度主观性和上下文敏感性。
Insight: 创新点在于提出了一个基于框架的标注方案来解构复杂的道德化言论,并构建了一个多体裁的德语语料库。从客观角度看,该研究为分析道德话语的计算方法提供了新的资源和基准,并揭示了当前LLMs在处理这种高度语境化、主观性任务时的局限性。
Abstract: Moralizations - arguments that invoke moral values to justify demands or positions - are a yet underexplored form of persuasive communication. We present the Moralization Corpus, a novel multi-genre dataset designed to analyze how moral values are strategically used in argumentative discourse. Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems. We develop a frame-based annotation scheme that captures the constitutive elements of moralizations - moral values, demands, and discourse protagonists - and apply it to a diverse set of German texts, including political debates, news articles, and online discussions. The corpus enables fine-grained analysis of moralizing language across communicative formats and domains. We further evaluate several large language models (LLMs) under varied prompting conditions for the task of moralization detection and moralization component extraction and compare it to human annotations in order to investigate the challenges of automatic and manual analysis of moralizations. Results show that detailed prompt instructions has a greater effect than few-shot or explanation-based prompting, and that moralization remains a highly subjective and context-sensitive task. We release all data, annotation guidelines, and code to foster future interdisciplinary research on moral discourse and moral reasoning in NLP.
[11] Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning cs.CL | cs.AIPDF
Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, Chen Gong
TL;DR: 本文提出了一种名为渐进式前缀标记策略优化(PPPO)的新方法,用于增强大语言模型(LLM)的推理能力。该方法基于路径依赖理论,识别出LLM推理中的‘起始锁定效应’,并通过专注于优化输出前缀段来提升训练效率。PPPO引入了渐进式前缀保留和延续累积奖励两种训练策略,实验表明其在多种推理任务上优于现有强化学习与可验证奖励(RLVR)方法,仅用26.17%的训练标记就实现了18.02%的准确率提升。
Details
Motivation: 当前RLVR方法通常对所有生成标记进行训练,但忽略了哪些标记(如前缀标记)实际对推理有贡献,这种均匀训练策略浪费了优化低回报标记的努力,阻碍了高回报标记的潜在改进,降低了整体训练效果。
Result: 在多种推理任务上的广泛实验结果表明,PPPO优于代表性的RLVR方法,仅使用26.17%的训练标记就实现了18.02%的准确率提升。
Insight: 创新点在于识别了LLM推理中的‘起始锁定效应’,并提出了专注于前缀优化的PPPO方法,通过渐进式前缀保留和延续累积奖励策略,有效提升了训练效率和模型性能,为RLVR训练提供了新的方向。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
[12] Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies cs.CL | cs.AIPDF
Charan Prakash Rathore, Saumi Ray, Dhruv Kumar
TL;DR: 本文系统评估了大型语言模型在沸石合成事件提取任务中的表现,通过比较四种提示策略在六个先进LLM上的效果,发现模型在事件类型分类上表现优异(80-90% F1),但在细粒度参数提取任务上表现一般(50-65% F1),且高级提示策略相比零样本方法提升有限。
Details
Motivation: 解决如何有效应用LLMs从沸石合成实验流程中提取结构化信息这一领域特定任务,并系统评估不同提示策略在该科学信息提取任务中的效能。
Result: 在ZSEE数据集(1,530条标注句子)上,事件类型分类F1达80-90%,但参数角色和参数文本提取F1仅50-65%;GPT-5-mini表现出极高的提示敏感性(F1波动11-79%);高级提示策略相比零样本方法改进甚微。
Insight: 研究揭示了LLMs在科学信息提取中的根本架构限制:虽能实现高层次理解,但精确提取实验参数需要领域自适应模型;同时量化了提示策略的边际收益,为科学文本处理提供了基准。
Abstract: Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.
[13] Dual-Density Inference for Efficient Language Model Reasoning cs.CLPDF
Zhengyi Zhao, Shubo Zhang, Yuxi Zhang, Huimin Wang, Binyang Li
TL;DR: 本文提出Denser框架,通过区分推理阶段的高密度压缩语言和答案生成阶段的人类可读语言,显著提升大语言模型在复杂推理任务中的计算效率。
Details
Motivation: 当前大语言模型在推理任务中统一使用相同语言密度进行中间推理和最终答案生成,导致计算效率低下,而推理过程本质是模型内部计算功能,答案生成则需满足人类理解需求。
Result: 在多个推理问答基准测试中,Denser相比标准思维链方法减少高达62%的token消耗,同时保持或提升准确率,在复杂多步推理问题上效率提升尤为显著。
Insight: 创新点在于提出双密度推理框架,将信息密度优化分离为计算导向的压缩推理和沟通导向的可读答案生成,通过模块化设计实现效率与可解释性的平衡。
Abstract: Large Language Models (LLMs) have shown impressive capabilities in complex reasoning tasks. However, current approaches employ uniform language density for both intermediate reasoning and final answers, leading to computational inefficiency. Our observation found that reasoning process serves a computational function for the model itself, while answering serves a communicative function for human understanding. This distinction enables the use of compressed, symbol-rich language for intermediate computations while maintaining human-readable final explanations. To address this inefficiency, we present Denser: \underline{D}ual-d\underline{ens}ity inf\underline{er}ence, a novel framework that optimizes information density separately for reasoning and answering phases. Our framework implements this through three components: a query processing module that analyzes input problems, a high-density compressed reasoning mechanism for efficient intermediate computations, and an answer generation component that translates compressed reasoning into human-readable solutions. Experimental evaluation across multiple reasoning question answering benchmarks demonstrates that Denser reduces token consumption by up to 62% compared to standard Chain-of-Thought methods while preserving or improving accuracy. These efficiency gains are particularly significant for complex multi-step reasoning problems where traditional methods generate extensive explanations.
[14] How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness cs.CL | cs.AI | cs.LGPDF
Darshita Rathore, Vineet Kumar, Chetna Bansal, Anindya Moitra
TL;DR: 本文系统评估了LoRA(低秩适应)在微调大语言模型时的秩(rank)选择对知识保留和领域鲁棒性的影响。通过在多类推理和召回数据集上进行秩扫描,量化了LoRA与全参数微调(SFT)之间的权衡,并比较了它们在领域内和领域外适应上的准确性差异。
Details
Motivation: 动机在于探究参数高效微调方法(如LoRA)的配置(特别是秩)对下游问答任务性能和泛化能力的影响,这一问题此前尚未得到充分探索。
Result: 实验表明,LoRA在特定秩值下,尤其是在推理任务上,能达到与SFT相当甚至更优的性能。研究还揭示了LoRA和SFT在领域适应中表现出不同的泛化行为和任务特异性遗忘。
Insight: 创新点在于通过秩扫描系统量化了LoRA与SFT的权衡,并利用谱特征和分层注意力结构分析内部表征,为理解表征漂移和注意力模式的结构性变化提供了新见解。
Abstract: Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.
cs.CV [Back]
[15] SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning cs.CV | cs.ROPDF
Tomohito Kawabata, Xinyu Zhang, Ling Xiao
TL;DR: 本文提出SocialNav-MoE,一种基于专家混合(MoE)架构的高效视觉语言模型,用于机器人在人群环境中的社会合规导航。该模型通过强化学习微调(RFT)和引入语义相似性奖励(SSR)来提升决策能力,旨在解决现有大规模VLM计算开销大、延迟高的问题,实现在资源受限平台上的实时部署。
Details
Motivation: 现有机器人导航研究多关注安全性,而忽视了社会合规性(如人类舒适度、社会规范和情境适宜性)。同时,大规模视觉语言模型(VLM)计算成本高,难以在资源受限的机器人平台上实时运行。
Result: 在SNEI数据集上的实验表明,SocialNav-MoE在导航准确性和效率之间取得了出色的平衡。所提出的语义相似性奖励(SSR)比硬级别和字符级别奖励更有效。
Insight: 创新点包括:1)采用专家混合(MoE)架构构建高效的小型VLM,以降低计算开销;2)引入强化学习微调(RFT)与专门的语义相似性奖励(SSR)来增强社会合规决策;3)系统研究了不同小型语言模型(Phi、Qwen、StableLM)、路由策略和视觉编码器(CLIP vs. SigLIP,冻结 vs. 微调)的有效性。
Abstract: For robots navigating in human-populated environments, safety and social compliance are equally critical, yet prior work has mostly emphasized safety. Socially compliant navigation that accounts for human comfort, social norms, and contextual appropriateness remains underexplored. Vision language models (VLMs) show promise for this task; however, large-scale models incur substantial computational overhead, leading to higher inference latency and energy consumption, which makes them unsuitable for real-time deployment on resource-constrained robotic platforms. To address this issue, we investigate the effectiveness of small VLM and propose SocialNav-MoE, an efficient Mixture-of-Experts vision language model for socially compliant navigation with reinforcement fine-tuning (RFT). We further introduce a semantic similarity reward (SSR) to effectively leverage RFT for enhancing the decision-making capabilities. Additionally, we study the effectiveness of different small language model types (Phi, Qwen, and StableLM), routing strategies, and vision encoders (CLIP vs. SigLIP, frozen vs. fine-tuned). Experiments on the SNEI dataset demonstrate that SocialNav-MoE achieves an excellent balance between navigation accuracy and efficiency. The proposed SSR function is more effective than hard-level and character-level rewards. Source code will be released upon acceptance.
[16] The Renaissance of Expert Systems: Optical Recognition of Printed Chinese Jianpu Musical Scores with Lyrics cs.CVPDF
Fan Bu, Rongfeng Li, Zijin Li, Ya Li, Linfeng Fan
TL;DR: 本文提出了一种模块化的专家系统流程,用于将带有歌词的印刷版简谱乐谱转换为机器可读的MusicXML和MIDI格式。该方法结合了传统计算机视觉技术和无监督深度学习模块,无需大量标注训练数据,并在《中国民歌选集》上实现了大规模数字化,取得了高精度的旋律和歌词识别结果。
Details
Motivation: 大规模光学音乐识别研究主要集中于西方五线谱,而中文简谱及其丰富的歌词资源尚未得到充分探索。本文旨在解决印刷简谱乐谱(含歌词)的自动识别与数字化问题,以填补这一研究空白。
Result: 在《中国民歌选集》上进行评估,系统成功数字化了超过5,000首纯旋律歌曲(>30万个音符)和超过1,400首带歌词的精选子集(>10万个音符)。系统在旋律识别(音符级F1分数=0.951)和对齐歌词识别(字符级F1分数=0.931)上均实现了高精度。
Insight: 摘要宣称的创新点在于采用自上而下的专家系统设计,结合传统计算机视觉技术(如乐句相关性、骨架分析)以利用先验知识,并集成无监督深度学习模块进行图像特征嵌入,在可解释性和准确性之间取得了平衡。从客观角度看,该方法为资源有限领域(如简谱)的OMR提供了一种高效、无需大量标注数据的混合解决方案。
Abstract: Large-scale optical music recognition (OMR) research has focused mainly on Western staff notation, leaving Chinese Jianpu (numbered notation) and its rich lyric resources underexplored. We present a modular expert-system pipeline that converts printed Jianpu scores with lyrics into machine-readable MusicXML and MIDI, without requiring massive annotated training data. Our approach adopts a top-down expert-system design, leveraging traditional computer-vision techniques (e.g., phrase correlation, skeleton analysis) to capitalize on prior knowledge, while integrating unsupervised deep-learning modules for image feature embeddings. This hybrid strategy strikes a balance between interpretability and accuracy. Evaluated on The Anthology of Chinese Folk Songs, our system massively digitizes (i) a melody-only collection of more than 5,000 songs (> 300,000 notes) and (ii) a curated subset with lyrics comprising over 1,400 songs (> 100,000 notes). The system achieves high-precision recognition on both melody (note-wise F1 = 0.951) and aligned lyrics (character-wise F1 = 0.931).
[17] AquaDiff: Diffusion-Based Underwater Image Enhancement for Addressing Color Distortion cs.CVPDF
Afrah Shaahid, Muzammil Behzad
TL;DR: 本文提出了一种基于扩散模型的水下图像增强框架AquaDiff,旨在解决水下图像因波长依赖的光吸收和散射导致的颜色失真、低对比度和细节丢失问题。该框架通过结合色度先验引导的颜色补偿策略与条件扩散过程,在去噪步骤中动态融合退化输入和噪声潜在状态,并采用增强的去噪主干网络捕获全局颜色上下文和局部细节。此外,引入了一种新颖的跨域一致性损失来联合优化像素级精度、感知相似性、结构完整性和频域保真度。
Details
Motivation: 水下图像因光吸收和散射而严重退化,导致颜色失真、对比度低和细节丢失,阻碍了基于视觉的水下应用。现有方法在颜色校正和保真度方面存在不足,因此需要一种能够有效校正颜色失真并保持结构和感知保真度的增强方法。
Result: 在多个具有挑战性的水下基准测试上进行广泛实验,结果表明AquaDiff相比最先进的传统方法、CNN、GAN和基于扩散的方法提供了良好的结果,实现了卓越的颜色校正和在不同水下条件下具有竞争力的整体图像质量。
Insight: 创新点包括:1)结合色度先验引导的颜色补偿策略与条件扩散过程,通过交叉注意力动态融合输入和噪声状态;2)采用增强的去噪主干网络(残差密集块和多分辨率注意力)捕获全局和局部特征;3)提出跨域一致性损失,联合优化多个图像质量指标。从客观角度看,该方法通过扩散模型和先验知识的结合,有效解决了水下图像增强中的颜色失真问题,并保持了图像的结构和感知保真度。
Abstract: Underwater images are severely degraded by wavelength-dependent light absorption and scattering, resulting in color distortion, low contrast, and loss of fine details that hinder vision-based underwater applications. To address these challenges, we propose AquaDiff, a diffusion-based underwater image enhancement framework designed to correct chromatic distortions while preserving structural and perceptual fidelity. AquaDiff integrates a chromatic prior-guided color compensation strategy with a conditional diffusion process, where cross-attention dynamically fuses degraded inputs and noisy latent states at each denoising step. An enhanced denoising backbone with residual dense blocks and multi-resolution attention captures both global color context and local details. Furthermore, a novel cross-domain consistency loss jointly enforces pixel-level accuracy, perceptual similarity, structural integrity, and frequency-domain fidelity. Extensive experiments on multiple challenging underwater benchmarks demonstrate that AquaDiff provides good results as compared to the state-of-the-art traditional, CNN-, GAN-, and diffusion-based methods, achieving superior color correction and competitive overall image quality across diverse underwater conditions.
[18] Improving VQA Reliability: A Dual-Assessment Approach with Self-Reflection and Cross-Model Verification cs.CV | cs.AIPDF
Xixian Wu, Yang Ou, Pengchao Tian, Zian Yang, Jielei Zhang
TL;DR: 本文提出了一种名为DAVR的双评估框架,旨在提升视觉语言模型在视觉问答任务中的可靠性。该框架通过整合自我反思和跨模型验证两种机制,对模型回答的不确定性进行全面评估,以减少幻觉并增强答案的可信度。
Details
Motivation: 视觉语言模型在视觉问答中表现出色,但容易产生幻觉,导致模型给出自信但错误的答案,严重损害了回答的可靠性。
Result: 在ICCV-CLVL 2025的Reliable VQA Challenge中,DAVR取得了领先的Φ₁₀₀分数39.64和100-AUC分数97.22,获得了第一名,证明了其在提升VLM响应可信度方面的有效性。
Insight: 主要创新点在于提出了一个结合内部特征融合评估(自我反思)与外部事实交叉验证(跨模型验证)的双路径架构,为VLM的可靠性评估提供了一种综合性的不确定性估计方法。
Abstract: Vision-language models (VLMs) have demonstrated significant potential in Visual Question Answering (VQA). However, the susceptibility of VLMs to hallucinations can lead to overconfident yet incorrect answers, severely undermining answer reliability. To address this, we propose Dual-Assessment for VLM Reliability (DAVR), a novel framework that integrates Self-Reflection and Cross-Model Verification for comprehensive uncertainty estimation. The DAVR framework features a dual-pathway architecture: one pathway leverages dual selector modules to assess response reliability by fusing VLM latent features with QA embeddings, while the other deploys external reference models for factual cross-checking to mitigate hallucinations. Evaluated in the Reliable VQA Challenge at ICCV-CLVL 2025, DAVR achieves a leading $Φ_{100}$ score of 39.64 and a 100-AUC of 97.22, securing first place and demonstrating its effectiveness in enhancing the trustworthiness of VLM responses.
[19] HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering cs.CV | eess.IVPDF
Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
TL;DR: 本文提出了HERBench,一个专门用于评估视频问答中多证据跨时间整合能力的新基准。该基准包含26K个五选一选择题,要求模型必须整合至少三个非重叠的视频片段证据才能正确回答,旨在解决现有基准对单线索推理的过度测试问题。
Details
Motivation: 现有视频问答基准往往允许问题通过单个显著线索回答,未能充分测试需要聚合多个时间分离视觉证据的推理能力,因此需要构建一个专门评估多证据跨时间整合的基准。
Result: 在HERBench上评估13个最先进的视频大语言模型,发现其准确率仅为31-42%,仅略高于20%的随机猜测基线,揭示了模型在多证据整合上的普遍失败。该基准的平均最小必需帧集为5.5,显著高于先前数据集(2.6-4.2)。
Insight: 创新点在于引入了最小必需帧集作为量化证据需求的指标,并将模型失败分解为检索缺陷和融合缺陷两个关键瓶颈,为推进稳健、组合式的视频理解提供了原则性目标。
Abstract: Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.
[20] Isolated Sign Language Recognition with Segmentation and Pose Estimation cs.CVPDF
Daniel Perkins, Davis Hunter, Dhrumil Patel, Galen Flanagan
TL;DR: 本文提出了一种用于孤立手语识别(ISLR)的模型,通过结合姿态估计、分割模块和ResNet-Transformer骨干网络,在降低计算需求的同时保持对签名者变化的鲁棒性。
Details
Motivation: 解决美国手语(ASL)用户因语言依赖复杂视觉线索而难以受益于大型语言模型翻译进展的问题,当前ISLR任务面临数据稀缺、签名者差异大和计算成本高的限制。
Result: 未在摘要中提及具体定量结果或基准测试,但宣称模型能减少计算需求并增强对签名者变化的鲁棒性。
Insight: 创新点包括整合姿态估计提取手和面部关节坐标、分割模块隔离相关信息,以及ResNet-Transformer联合建模时空依赖,可借鉴于视觉任务中结合轻量化预处理与高效骨干网络的设计。
Abstract: The recent surge in large language models has automated translations of spoken and written languages. However, these advances remain largely inaccessible to American Sign Language (ASL) users, whose language relies on complex visual cues. Isolated sign language recognition (ISLR) - the task of classifying videos of individual signs - can help bridge this gap but is currently limited by scarce per-sign data, high signer variability, and substantial computational costs. We propose a model for ISLR that reduces computational requirements while maintaining robustness to signer variation. Our approach integrates (i) a pose estimation pipeline to extract hand and face joint coordinates, (ii) a segmentation module that isolates relevant information, and (iii) a ResNet-Transformer backbone to jointly model spatial and temporal dependencies.
[21] Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris cs.CVPDF
Wenshuo Li, Majid Mirmehdi, Tilo Burghardt
TL;DR: 本文提出了一种结合视觉与文本描述(皮纹学特征)的动物重识别方法,首次应用于老虎(Panthera tigris)研究。通过构建包含3,355张图像、185只老虎、84,264个手动标注细节的数据集,并开发文本-图像协同合成管道生成虚拟个体,该方法在跨模态身份检索中显著提升了AI准确性,同时缓解了数据稀缺问题。
Details
Motivation: 传统动物重识别主要依赖视觉特征,但存在局限性;本文旨在将法医学中使用的精确皮纹学文本描述引入生态学,以结合人类可解释的语言标签来抽象编码动物皮毛拓扑结构,从而克服纯视觉方法的限制。
Result: 在真实场景基准测试中,该方法通过数据增强显著提升了跨模态检索的AI准确性,实现了文本到视觉的身份恢复,并支持人类可验证的匹配。
Insight: 创新点在于将皮纹学文本描述与视觉数据结合,提出了视觉-文本跨模态重识别框架;通过文本-图像协同合成生成虚拟个体以增强数据,推动了生态监测中描述模态的语言驱动统一,并提高了重识别任务的可解释性。
Abstract: Biologists have long combined visuals with textual field notes to re-identify (Re-ID) animals. Contemporary AI tools automate this for species with distinctive morphological features but remain largely image-based. Here, we extend Re-ID methodologies by incorporating precise dermatoglyphic textual descriptors-an approach used in forensics but new to ecology. We demonstrate that these specialist semantics abstract and encode animal coat topology using human-interpretable language tags. Drawing on 84,264 manually labelled minutiae across 3,355 images of 185 tigers (Panthera tigris), we evaluate this visual-textual methodology, revealing novel capabilities for cross-modal identity retrieval. To optimise performance, we developed a text-image co-synthesis pipeline to generate ‘virtual individuals’, each comprising dozens of life-like visuals paired with dermatoglyphic text. Benchmarking against real-world scenarios shows this augmentation significantly boosts AI accuracy in cross-modal retrieval while alleviating data scarcity. We conclude that dermatoglyphic language-guided biometrics can overcome vision-only limitations, enabling textual-to-visual identity recovery underpinned by human-verifiable matchings. This represents a significant advance towards explainability in Re-ID and a language-driven unification of descriptive modalities in ecological monitoring.
[22] Vibe Spaces for Creatively Connecting and Expressing Visual Concepts cs.CVPDF
Huzheng Yang, Katherine Xu, Andrew Lu, Michael D. Grossberg, Yutong Bai
TL;DR: 本文提出了一种名为Vibe Space的层次图流形方法,用于解决在潜在空间中连接和混合不同视觉概念的创造性任务(Vibe Blending)。该方法学习CLIP等特征空间中的低维测地线,以实现概念间平滑且语义一致的过渡。论文还设计了一个结合人类判断、LLM推理和几何路径难度分数的认知启发式评估框架,验证了所提方法在生成更具创造性和连贯性的混合结果上的优势。
Details
Motivation: 当前方法在识别和遍历潜在空间中连接远距离概念的非线性路径方面存在困难,难以通过共享属性(vibe)创造性地连接不同视觉概念并生成有意义的混合体。
Result: 在人类评估中,Vibe Space生成的混合结果被一致评为比现有方法更具创造性和连贯性。论文通过设计的认知评估框架(结合人类判断、LLM推理和几何路径难度分数)进行了定性定量验证。
Insight: 主要创新点在于提出了Vibe Space这一层次图流形结构,它能够学习特征空间中的低维测地线,从而实现对概念间非线性关系的建模和平滑过渡;同时,论文引入了一个多模态的、认知启发式的评估框架来量化创造性生成任务的质量,这为类似生成任务的评估提供了新思路。
Abstract: Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes – their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.
[23] TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation cs.CV | cs.AI | cs.MM | cs.SDPDF
Zhenzhi Wang, Jian Wang, Ke Ma, Dahua Lin, Bing Zhou
TL;DR: TalkVerse是一个用于单人音频驱动说话视频生成的大规模开放数据集,包含230万条高分辨率(720p/1080p)音视频同步片段,总时长6300小时。基于此数据集,论文提出了一个可复现的50亿参数DiT基线模型,该模型通过高下采样率视频VAE和滑动窗口机制,能以较低推理成本实现分钟级视频生成,并支持通过MLLM导演增强长视频叙事和零样本视频配音。
Details
Motivation: 当前最先进的音频驱动视频生成系统依赖于封闭数据或计算密集型模型,缺乏公平、可复现的比较基准。TalkVerse旨在通过提供一个大规模、高质量、开放的数据集和可复现的基线模型,降低该领域的研究门槛。
Result: 基于TalkVerse训练的50亿参数模型在唇部同步和视觉质量上与140亿参数的Wan-S2V模型相当,但推理成本降低了10倍,能够实现分钟级生成且漂移较低。
Insight: 创新点包括:1) 构建了一个透明、严格筛选的大规模开放数据集TalkVerse,并提供了全面的标注;2) 提出了一个高效的基线模型架构,结合高下采样率视频VAE和滑动窗口机制,以较低成本生成长视频;3) 引入了MLLM导演模块来增强长视频的叙事连贯性;4) 提出了通过控制潜在噪声注入实现零样本视频配音的方法。数据集、训练方案和模型检查点均已开源。
Abstract: We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: https://zhenzhiwang.github.io/talkverse/
[24] Puzzle Curriculum GRPO for Vision-Centric Reasoning cs.CVPDF
Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao
TL;DR: 本文提出了一种名为Puzzle Curriculum GRPO(PC-GRPO)的无监督强化学习方法,用于增强视觉语言模型(VLM)的视觉推理能力。该方法通过三个自监督的拼图环境(PatchFit、Rotation和Jigsaw)生成可验证的奖励,并引入难度感知课程来动态调整样本权重,以解决现有GRPO方法中奖励稀疏、平坦以及推理链与最终答案逻辑不一致的问题。
Details
Motivation: 针对当前基于结果监督的GRPO等方法在视觉语言模型推理中存在的依赖昂贵噪声标注或外部验证器、奖励方案平坦稀疏、以及推理链与最终答案逻辑不一致等关键问题,旨在开发一种无需标注或外部验证器的、可扩展且可解释的强化学习后训练方法。
Result: 在Qwen-7B和Qwen-3B骨干网络及多个基准测试上,PC-GRPO提升了推理质量、训练稳定性和下游任务准确率。推理-答案一致性(RAC)与下游准确率相关,且课程设计和一致性强制奖励方案延缓了RAC的下降并进一步提升了它。
Insight: 创新点在于用自监督拼图任务生成可验证奖励以替代人工标注,并设计难度感知课程来动态调整训练样本,有效缓解了奖励稀疏和平坦问题,同时通过监控和增强推理-答案一致性来提升模型的可解释性和性能。这为VLM的可扩展、可验证强化学习后训练提供了实用路径。
Abstract: Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain’s reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.
[25] Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities cs.CV | cs.SD | eess.AS | eess.IVPDF
Aref Farhadipour, Teodora Vukovic, Volker Dellwo, Petr Motlicek, Srikanth Madikeri
TL;DR: 本文提出了一种名为Trimodal的鲁棒性多模态人物识别框架,旨在整合语音、面部和手势三种模态,并有效处理现实场景中模态缺失或质量下降的问题。该方法采用多任务学习独立处理各模态,结合交叉注意力和门控融合机制促进模态间交互,并通过置信度加权融合策略动态适应缺失或低质量数据。在CANDOR和VoxCeleb1数据集上的实验表明,该框架在完整三模态下达到99.18%的Top-1准确率,在双模态下达到99.92%准确率,且在模态缺失时仍保持高性能。
Details
Motivation: 解决现实世界中人物识别系统因音频、视觉或行为模态缺失或退化而导致的性能下降问题,提升系统在非理想条件下的鲁棒性。
Result: 在首次基准测试的CANDOR数据集上,三模态人物识别达到99.18% Top-1准确率,优于传统单模态和晚期融合方法;在VoxCeleb1数据集的双模态模式下达到99.92%准确率,且在模态缺失场景下仍保持高精度。
Insight: 创新点包括多任务学习与交叉注意力/门控融合的结合,以及动态置信度加权融合策略,可借鉴于其他多模态任务以增强对缺失数据的适应性。
Abstract: Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.
[26] Beyond Proximity: A Keypoint-Trajectory Framework for Classifying Affiliative and Agonistic Social Networks in Dairy Cattle cs.CV | cs.AIPDF
Sibi Parivendan, Kashfia Sailunaz, Suresh Neethirajan
TL;DR: 本文提出了一种基于姿态关键点轨迹的计算机视觉框架,用于区分奶牛间的亲和与敌对社交行为,超越了传统基于静态距离阈值的方法。该框架集成了目标检测、个体识别、多目标跟踪、解剖关键点估计和SVM分类器,在商业奶牛场数据上实现了仅凭姿态信息对社交互动效价的自动分类。
Details
Motivation: 现有精准畜牧业方法通常依赖静态邻近性阈值来推断动物互动,无法在复杂的牛舍环境中区分亲和与敌对行为,限制了自动化社交网络分析的可解释性。本文旨在开发一个超越简单距离启发式、基于姿态时空几何的互动分类框架。
Result: 在商业奶牛场标注的互动片段上,仅使用姿态信息的SVM分类器在区分亲和与敌对行为上达到了77.51%的准确率。相比仅基于邻近性的基线方法,该框架在行为辨别上取得了显著提升,尤其对亲和性互动。各组件性能包括:YOLOv11目标检测mAP@0.50为96.24%,个体识别准确率98.24%,ByteTrack多目标跟踪准确率81.96%。
Insight: 主要创新在于摒弃了基于像素外观或简单距离度量的传统思路,转而从解剖关键点轨迹中编码互动特有的运动特征,以区分社交互动的效价。这为构建基于视觉的、感知互动的社交网络提供了一个端到端的、概念验证的解决方案,并在商用硬件上实现了近实时性能。
Abstract: Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.
[27] Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation cs.CV | cs.AIPDF
Huaying Zhang, Atsushi Hashimoto, Tosho Hirasawa
TL;DR: 本文提出了一种评估视频问题生成(VQG)模型在从人类专家中引出未见知识方面能力的协议,并构建了名为EgoExoAsk的新数据集用于训练和基准测试。
Details
Motivation: 为了解决如何定量评估问题生成模型的有效性,特别是其生成的问题在引出专家未知知识方面的质量,而非传统的基于问题回答能力的评估。
Result: 实验结果表明,所提出的评估指标与问题生成设置合理对齐:能够访问更丰富上下文的模型获得更好的评估分数,验证了协议的有效性。基准测试基于Ego-Exo4D验证集的视频片段构建。
Insight: 创新点在于将VQG评估重点从问题回答能力转向问题生成质量,特别是其引出专家知识的能力,并通过模拟专家问答通信的检索协议和构建EgoExoAsk数据集来实现定量评估。
Abstract: Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D’s expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in https://github.com/omron-sinicx/VQG4ExpertKnowledge .
[28] Model Agnostic Preference Optimization for Medical Image Segmentation cs.CVPDF
Yunseong Nam, Jiwon Jang, Dongkyu Won, Sang Hyun Park, Soopil Kim
TL;DR: 本文提出了一种模型无关的偏好优化框架MAPO,用于医学图像分割。该方法利用Dropout驱动的随机分割假设来构建偏好一致的梯度,无需直接依赖真实标签监督,并支持2D/3D CNN和Transformer架构。
Details
Motivation: 解决现有医学图像分割中偏好优化方法模型特定、预测样本多样性低的问题,旨在提供一种可扩展且不依赖具体模型的监督范式。
Result: 在多个医学数据集上的综合评估表明,MAPO相比传统监督训练能持续提升边界贴合度、减少过拟合并带来更稳定的优化动态。
Insight: 创新点在于提出了一种完全模型无关的偏好优化框架,通过Dropout生成多样化的分割假设来构建训练信号,避免了直接真实标签依赖,增强了方法的通用性和鲁棒性。
Abstract: Preference optimization offers a scalable supervision paradigm based on relative preference signals, yet prior attempts in medical image segmentation remain model-specific and rely on low-diversity prediction sampling. In this paper, we propose MAPO (Model-Agnostic Preference Optimization), a training framework that utilizes Dropout-driven stochastic segmentation hypotheses to construct preference-consistent gradients without direct ground-truth supervision. MAPO is fully architecture- and dimensionality-agnostic, supporting 2D/3D CNN and Transformer-based segmentation pipelines. Comprehensive evaluations across diverse medical datasets reveal that MAPO consistently enhances boundary adherence, reduces overfitting, and yields more stable optimization dynamics compared to conventional supervised training.
[29] MVGSR: Multi-View Consistent 3D Gaussian Super-Resolution via Epipolar Guidance cs.CVPDF
Kaizhe Zhang, Shinan Chen, Qian Zhao, Weizhan Zhang, Caixia Yan
TL;DR: 本文提出了MVGSR,一种用于3D高斯泼溅(3DGS)的多视角一致超分辨率框架。该框架旨在解决现有基于单图或视频序列的3DGS超分辨率方法在跨视角一致性和信息融合上的不足,通过引入基于相机位姿的辅助视角选择方法和极线约束的多视角注意力机制,从任意组织的多视角数据集中整合信息,以生成具有高频细节和增强一致性的高分辨率3DGS渲染。
Details
Motivation: 解决现有3DGS超分辨率方法的局限性:基于单图像SR网络的方法缺乏跨视角一致性且无法融合多视角互补信息;基于视频SR的方法需要严格的时序帧,难以应用于非结构化的多视角数据集。
Result: 在面向物体中心和场景级别的3DGS超分辨率基准测试上进行了广泛实验,结果表明该方法达到了最先进的(SOTA)性能。
Insight: 主要创新点包括:1) 提出基于相机位姿的辅助视角选择方法,使框架能适应任意组织的多视角数据集,无需时序连续性或数据重排;2) 首次将极线约束的多视角注意力机制引入3DGS超分辨率,作为核心网络设计,能选择性地聚合来自辅助视角的一致信息,从而增强3DGS表示的几何一致性和细节保真度。
Abstract: Scenes reconstructed by 3D Gaussian Splatting (3DGS) trained on low-resolution (LR) images are unsuitable for high-resolution (HR) rendering. Consequently, a 3DGS super-resolution (SR) method is needed to bridge LR inputs and HR rendering. Early 3DGS SR methods rely on single-image SR networks, which lack cross-view consistency and fail to fuse complementary information across views. More recent video-based SR approaches attempt to address this limitation but require strictly sequential frames, limiting their applicability to unstructured multi-view datasets. In this work, we introduce Multi-View Consistent 3D Gaussian Splatting Super-Resolution (MVGSR), a framework that focuses on integrating multi-view information for 3DGS rendering with high-frequency details and enhanced consistency. We first propose an Auxiliary View Selection Method based on camera poses, making our method adaptable for arbitrarily organized multi-view datasets without the need of temporal continuity or data reordering. Furthermore, we introduce, for the first time, an epipolar-constrained multi-view attention mechanism into 3DGS SR, which serves as the core of our proposed multi-view SR network. This design enables the model to selectively aggregate consistent information from auxiliary views, enhancing the geometric consistency and detail fidelity of 3DGS representations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both object-centric and scene-level 3DGS SR benchmarks.
[30] Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank cs.CV | cs.AIPDF
Chenxiao Zhang, Runshi Zhang, Junchen Wang
TL;DR: 本文提出了一种基于记忆库的小波滤波与融合网络(MWNet),用于解决超声长视频中病灶区域和目标器官的高保真分割难题。该方法通过编码器-解码器结构,结合小波分析和记忆库机制,有效提取细粒度空间特征、融合高频信息,并实现长视频中的目标跟踪。
Details
Motivation: 解决超声视频因低对比度和噪声背景导致的器官边界误分割、小目标丢失以及长视频目标跟踪困难的问题。
Result: 在四个超声视频数据集(两个甲状腺结节、甲状腺、心脏数据集)上的基准测试表明,该方法在分割指标上显著优于现有最先进方法,尤其在小甲状腺结节分割上表现更准确。
Insight: 创新点包括:基于记忆的小波卷积同时捕获类别和细节信息;级联小波压缩融合多尺度频域特征;设计使用交叉注意力和记忆压缩机制的长短期记忆库进行长视频跟踪;以及通过自适应小波滤波器的高频感知特征融合模块充分利用边界敏感细节。
Abstract: Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at https://github.com/XiAooZ/MWNet.
[31] PMMD: A pose-guided multi-view multi-modal diffusion for person generation cs.CV | cs.AIPDF
Ziyu Shang, Haoran Liu, Rongchao Zhang, Zhiqian Wei, Tongtong Feng
TL;DR: 该论文提出了一种名为PMMD的扩散模型框架,用于生成具有可控姿态和外观的逼真人物图像。该模型通过多视角参考图像、姿态图和文本提示作为条件输入,解决了现有方法在遮挡、服装风格漂移和姿态错位方面的问题。
Details
Motivation: 现有方法在生成一致且可控的人物图像时,常面临遮挡、服装风格漂移和姿态错位等挑战,这限制了其在虚拟试穿、图像编辑和数字人创建等应用中的效果。
Result: 在DeepFashion MultiModal数据集上的实验表明,PMMD在一致性、细节保持和可控性方面优于代表性基线方法。
Insight: 创新点包括:设计了一个多模态编码器来联合建模视觉视角、姿态特征和语义描述,以减少跨模态差异并提高身份保真度;提出了ResCVA模块以在保持全局结构的同时增强局部细节;以及一个跨模态融合模块,在整个去噪流程中整合图像语义与文本信息。
Abstract: Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at https://github.com/ZANMANGLOOPYE/PMMD.
[32] Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets cs.CVPDF
Jialong Zuo, Haoyou Deng, Hanyu Zhou, Jiaxin Zhu, Yicheng Zhang
TL;DR: 本研究对Nano Banana Pro在低层视觉任务中的零样本性能进行了全面评估,涵盖14个任务和40个数据集,发现其在主观视觉质量上表现出色,但在传统定量指标上落后于专业模型。
Details
Motivation: 探索商业文本到图像生成模型(如Nano Banana Pro)作为通用解决器在传统低层视觉任务中的潜力,这些任务通常由专用模型处理,其能力尚未被充分研究。
Result: 在零样本设置下,Nano Banana Pro在主观视觉质量上优于专业模型,能生成更合理的高频细节,但在基于参考的定量指标(如PSNR、SSIM)上表现不佳,未达到SOTA水平。
Insight: 论文的创新点在于首次对生成式模型进行大规模、跨任务的低层视觉零样本评估,揭示了生成模型在主观质量与像素级保真度之间的权衡,挑战了传统评估指标对生成式低层视觉任务的适用性。
Abstract: The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbf{Nano Banana Pro demonstrates superior subjective visual quality}, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.
[33] 3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding cs.CVPDF
Yupeng Zhu, Xiongzhen Zhang, Ye Chen, Bingbing Ni
TL;DR: 本文提出了一种名为3DProxyImg的轻量级3D动画生成框架,旨在解决单图像3D动画生成中渲染质量与3D控制之间的权衡问题。该方法的核心是引入一种2D-3D对齐的代理表示,将几何控制与外观合成解耦,从而在保持高质量外观的同时,实现类似传统管道的3D感知运动控制和交互,并支持连贯的背景动画。
Details
Motivation: 传统3D动画制作流程劳动密集、专业要求高且计算成本昂贵。现有的AIGC方法要么继承了完整3D流程的高昂成本,要么依赖牺牲3D可控性和交互性的视频合成范式。本文聚焦于单图像3D动画生成,旨在突破渲染质量与3D控制之间的根本性权衡限制。
Result: 广泛的实验表明,该方法在低功耗平台上实现了高效的动画生成,并且在身份保持、几何与纹理一致性以及提供给用户的精确交互控制水平方面,优于基于视频的3D动画生成方法。
Insight: 主要创新点在于提出了一种2D-3D对齐的代理表示,将粗略的3D估计作为结构载体,而将高保真外观和视角合成委托给学习到的图像空间生成先验。这种代理公式化使得无需精确几何或昂贵优化即可实现3D控制,并自然地扩展到背景动画,在可控性、质量和效率之间取得了良好平衡。
Abstract: 3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.
[34] Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning cs.CVPDF
Mengshi Qi, Yeteng Wu, Xianlin Zhang, Huadong Ma
TL;DR: 本文提出了一项新任务——人类动作形态评估(AFA),旨在评估动作是否标准并提供可解释的反馈。作者构建了一个包含健身和武术视频的多层次标注数据集CoT-AFA,并引入了思维链解释范式来提供完整的推理过程。同时,提出了一个名为可解释健身评估器的框架,该框架通过并行处理流和动态门控机制融合视觉与语义信息,以提升分析能力。
Details
Motivation: 现有视频理解方法主要关注动作的类别和位置,无法评估动作的标准化程度并提供可解释的反馈;同时,现有数据集缺乏动作标准化程度的标注,且动作质量评估数据集缺乏可解释性和详细反馈。
Result: 实验结果表明,该方法在解释生成(如CIDEr指标提升16.0%)、动作分类(准确率提升2.7%)和质量评估(准确率提升2.1%)方面均取得了改进,展现了CoT-AFA数据集在未来研究中的潜力。
Insight: 创新点包括定义新的AFA任务、构建具有思维链解释范式的多模态数据集CoT-AFA,以及提出融合视觉与语义信息的可解释评估框架;其思维链推理过程和动态信息融合机制为可解释性动作分析提供了新思路。
Abstract: Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process–from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.
[35] EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence cs.CVPDF
Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu
TL;DR: EagleVision是一个用于空间智能的双阶段框架,通过宏观感知和微观验证实现渐进式空间认知。它利用语义-视角融合行列式点过程(SPF-DPP)在固定token预算下从长视频中选择关键帧,并将空间思维链形式化为基于BEV的姿态查询,通过强化学习进行训练。
Details
Motivation: 解决现有空间智能方法中空间一致性弱、视角多样性有限、证据链无法回溯到支持视图的问题,以及空间思维链面临的三个关键挑战:在严格token预算下构建全局空间感知、将3D假设与视频帧显式关联以进行验证、设计基于空间接地的强化学习奖励。
Result: 在VSI-Bench基准测试中,EagleVision在开源视觉语言模型中达到了最先进的性能,展示了强大且可泛化的空间理解能力。
Insight: 创新点包括:1)双阶段渐进式空间认知框架;2)SPF-DPP方法用于高效关键帧选择;3)将空间思维链形式化为BEV接地的姿态查询,并结合强化学习与空间接地奖励进行训练,增强了可解释性和一致性。
Abstract: Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for “thinking with images” (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.
[36] Cross-modal ultra-scale learning with tri-modalities of renal biopsy images for glomerular multi-disease auxiliary diagnosis cs.CVPDF
Kaixing Long, Danyi Weng, Yun Mi, Zhentai Zhang, Yanmeng Lu
TL;DR: 本文提出了一种跨模态超尺度学习网络(CMUS-Net),用于基于肾活检的三种模态图像(光学显微镜、免疫荧光显微镜和透射电子显微镜)进行肾小球多疾病的辅助诊断。该方法通过稀疏多实例学习模块和跨模态尺度注意力模块,解决了纳米级与微米级图像之间的巨大尺度差异带来的特征融合挑战,从而提高了分类精度。
Details
Motivation: 现有基于多模态和多尺度的模型难以有效融合纳米级(透射电子显微镜)与微米级(光学显微镜、免疫荧光显微镜)图像特征,这阻碍了肾小球多疾病自动分类精度的提升。
Result: 在内部数据集上,CMUS-Net在IgA肾病、膜性肾病和狼疮性肾炎的分类任务中取得了ACC为95.37±2.41%、AUC为99.05±0.53%、F1分数为95.32±2.41%的结果,优于其他知名的多模态或多尺度方法,并在膜性肾病分期任务中展示了泛化能力。
Insight: 创新点包括:1) 利用超微结构信息桥接纳米与微米尺度差异;2) 引入稀疏多实例学习模块聚合透射电子显微镜图像特征;3) 设计跨模态尺度注意力模块促进特征交互以增强病理语义信息。该方法首次基于三模态双尺度图像实现了对多种肾小球疾病的自动分类。
Abstract: Constructing a multi-modal automatic classification model based on three types of renal biopsy images can assist pathologists in glomerular multi-disease identification. However, the substantial scale difference between transmission electron microscopy (TEM) image features at the nanoscale and optical microscopy (OM) or immunofluorescence microscopy (IM) images at the microscale poses a challenge for existing multi-modal and multi-scale models in achieving effective feature fusion and improving classification accuracy. To address this issue, we propose a cross-modal ultra-scale learning network (CMUS-Net) for the auxiliary diagnosis of multiple glomerular diseases. CMUS-Net utilizes multiple ultrastructural information to bridge the scale difference between nanometer and micrometer images. Specifically, we introduce a sparse multi-instance learning module to aggregate features from TEM images. Furthermore, we design a cross-modal scale attention module to facilitate feature interaction, enhancing pathological semantic information. Finally, multiple loss functions are combined, allowing the model to weigh the importance among different modalities and achieve precise classification of glomerular diseases. Our method follows the conventional process of renal biopsy pathology diagnosis and, for the first time, performs automatic classification of multiple glomerular diseases including IgA nephropathy (IgAN), membranous nephropathy (MN), and lupus nephritis (LN) based on images from three modalities and two scales. On an in-house dataset, CMUS-Net achieves an ACC of 95.37+/-2.41%, an AUC of 99.05+/-0.53%, and an F1-score of 95.32+/-2.41%. Extensive experiments demonstrate that CMUS-Net outperforms other well-known multi-modal or multi-scale methods and show its generalization capability in staging MN. Code is available at https://github.com/SMU-GL-Group/MultiModal_lkx/tree/main.
[37] Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification cs.CV | cs.AIPDF
Yupeng Zhang, Adam G. Dunn, Usman Naseem, Jinman Kim
TL;DR: 本文提出了一种名为跨模态对齐一致性(CMAC-MMD)的训练框架,旨在解决医学视觉-语言模型(VLM)在疾病分类中存在的交叉性偏见问题,即模型对边缘化患者亚组(如特定年龄、性别、种族组合)的诊断置信度系统性偏低。该方法在不依赖临床推理时敏感人口统计数据的情况下,标准化了不同交叉亚组间的诊断确定性,从而在提高整体诊断准确性的同时,显著缩小了亚组间的诊断性能差距。
Details
Motivation: 医学AI系统,特别是多模态视觉-语言模型,常因数据的人口统计学偏差和诊断确定性分布差异,表现出交叉性偏见,导致对边缘化患者亚组的诊断置信度不足,从而增加误诊和漏诊风险。现有的公平性干预措施往往无法有效解决这些差距,或者为了追求亚组间的统计公平而牺牲整体诊断性能。
Result: 在皮肤病变图像数据集(HAM10000,10,015张图像)上,该方法将交叉性漏诊率差距(真阳性率差异,ΔTPR)从0.50降低至0.26,同时将整体AUC从0.94提升至0.97。在青光眼筛查数据集(Harvard-FairVLMed,10,000张眼底图像)上,ΔTPR从0.41降至0.31,AUC从0.71提升至0.72。在外部验证集(BCN20000,12,000张图像)上也进行了评估。
Insight: 论文的核心创新点在于提出了CMAC-MMD框架,它通过跨模态对齐来标准化不同交叉亚组的诊断确定性分布,从而在不牺牲整体性能(甚至能提升性能)的前提下,有效缓解交叉性偏见。其关键优势在于临床推理阶段无需敏感人口统计数据,降低了隐私风险,为开发既准确又公平的高风险临床决策支持系统提供了一个可扩展的框架。
Abstract: Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model’s decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $Δ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $Δ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.
[38] Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models cs.CV | cs.LGPDF
Kuinan Hou, Jing Mi, Marco Zorzi, Lamberto Ballan, Alberto Testolin
TL;DR: 本研究系统比较了专用计数架构与视觉语言模型(VLMs)在视觉场景计数任务上的性能,发现大多数VLMs能够近似枚举物体数量,其表现与专用架构相当甚至更优,尤其是在生成中间表示(如位置和语言标签)时准确性显著提升,但所有模型在复杂场景中仍无法可靠计数。
Details
Motivation: 解决视觉场景中物体计数的基本挑战,探索领域通用的大规模多模态视觉语言模型是否可作为传统专用计数架构的灵活替代方案,以应对开放集物体计数问题。
Result: 在两个流行计数数据集和一个新构建的细粒度控制视觉属性基准上,VLMs的表现匹配或超越了最先进的专用计数架构,但所有模型在复杂场景中计数可靠性不足。
Insight: 创新点在于系统评估VLMs在计数任务中的潜力,并发现通过提示生成中间表示(如物体位置和语言标签)能显著提升枚举准确性,这为利用VLMs的通用能力进行开放集计数提供了新思路,但复杂场景下的可靠性仍是未来研究的关键挑战。
Abstract: Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.
[39] MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement cs.CVPDF
Yingying Wang, Xuanhua He, Chen Wu, Jialing Huang, Suiyun Zhang
TL;DR: 本文提出MMMamba,一种基于Mamba架构的跨模态上下文融合框架,用于全色锐化任务,并支持零样本图像超分辨率。该方法通过引入多模态交错扫描机制,在保持线性计算复杂度的同时,实现了高效的跨模态信息交互。
Details
Motivation: 解决传统CNN方法依赖固定卷积操作、适应性有限,以及交叉注意力机制计算效率低、可能稀释细粒度对应关系的问题,旨在更有效地利用全色与多光谱图像之间的互补信息。
Result: 在多个任务和基准测试上的广泛实验表明,该方法相比现有最先进技术具有优越性能。
Insight: 创新点在于将Mamba架构与跨模态上下文条件化结合,并设计了多模态交错扫描机制,实现了线性复杂度下的高效跨模态融合,为图像融合任务提供了新的高效架构思路。
Abstract: Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.
[40] SynthSeg-Agents: Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation cs.CVPDF
Wangyu Wu, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao
TL;DR: 本文提出了一种名为SynthSeg-Agents的多智能体框架,用于零样本弱监督语义分割(ZSWSSS)。该框架完全无需真实图像,通过大型语言模型驱动的智能体生成合成训练数据,包括自我优化提示智能体和图像生成智能体,最终在PASCAL VOC 2012和COCO 2014数据集上实现了有竞争力的性能。
Details
Motivation: 解决弱监督语义分割(WSSS)依赖真实世界训练样本的问题,提出零样本弱监督语义分割(ZSWSSS)新方向,旨在完全通过合成数据生成训练模型,降低成本并提高可扩展性。
Result: 在PASCAL VOC 2012和COCO 2014数据集上的实验表明,SynthSeg-Agents在不使用真实训练图像的情况下,实现了有竞争力的性能,验证了其有效性。
Insight: 创新点包括:提出ZSWSSS新任务方向;设计多智能体框架(自我优化提示智能体和图像生成智能体),结合LLM、CLIP相似性和最近邻多样性过滤来生成多样化语义丰富的合成数据;利用冻结CLIP评分模型和ViT分类器提升数据质量。这展示了LLM驱动智能体在低成本、可扩展语义分割中的潜力。
Abstract: Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.
[41] Vision-based module for accurately reading linear scales in a laboratory cs.CV | cs.AIPDF
Parvesh Saini, Soumyadipta Maiti, Beena Rai
TL;DR: 本文提出了一种基于视觉的模块,用于在实验室环境中精确读取线性刻度(如注射器和量筒上的刻度)。该方法通过图像变换校正随机方向的注射器,将感兴趣区域缩小至包含刻度的部分,提取主要刻度标记、对应数字和液位指示器位置等特征,最终计算读数。系统读数与人工读数进行了比较,显示出高度一致性。
Details
Motivation: 当前基于视觉的模型在物体检测、图像分类等任务上表现出色,但能够像人类一样从图像中获取精确定量测量(如读取仪器刻度)的模型仍然稀缺。为了实现机器人在实验室环境中的完全自主操作,需要赋予其读取仪器测量值的基本能力。
Result: 将系统读数与同一实例的人工读数进行比较,观察到了精确的对应关系,表明该方法能够准确读取线性刻度。
Insight: 论文的创新点在于模仿人类读取刻度的方法,通过先进行方向校正和区域定位,再提取关键特征(刻度、数字、液位)来计算读数,这种分步、特征驱动的流程设计使得系统在特定任务上既高效又鲁棒。
Abstract: Capabilities and the number of vision-based models are increasing rapidly. And these vision models are now able to do more tasks like object detection, image classification, instance segmentation etc. with great accuracy. But models which can take accurate quantitative measurements form an image, as a human can do by just looking at it, are rare. For a robot to work with complete autonomy in a Laboratory environment, it needs to have some basic skills like navigation, handling objects, preparing samples etc. to match human-like capabilities in an unstructured environment. Another important capability is to read measurements from instruments and apparatus. Here, we tried to mimic a human inspired approach to read measurements from a linear scale. As a test case we have picked reading level from a syringe and a measuring cylinder. For a randomly oriented syringe we carry out transformations to correct the orientation. To make the system efficient and robust, the area of interest is reduced to just the linear scale containing part of the image. After that, a series of features were extracted like the major makers, the corresponding digits, and the level indicator location, from which the final reading was calculated. Readings obtained using this system were also compared against human read values of the same instances and an accurate correspondence was observed.
[42] Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics cs.CVPDF
Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li
TL;DR: 本文提出TIMAR(Turn-level Interleaved Masked AutoRegression)框架,用于3D对话头部动态生成。该框架将对话建模为交错的视听上下文,通过回合级因果注意力累积对话历史,并利用轻量级扩散头预测连续且协调的3D头部运动。
Details
Motivation: 现有方法通常将说话和倾听视为独立过程或依赖非因果的全序列建模,这阻碍了跨回合的时间连贯性。本文旨在解决3D对话中双向动态建模的因果性和连贯性问题。
Result: 在DualTalk基准测试中,TIMAR在测试集上将Fréchet距离和MSE降低了15-30%,并在分布外数据上取得了类似的提升。
Insight: 创新点在于提出了回合级交错掩码自回归(TIMAR)的因果框架,融合了多模态信息并应用回合级因果注意力来建模对话历史,同时采用轻量级扩散模型生成连续且富有表现力的3D头部动态。
Abstract: Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.
[43] Emotion Recognition in Signers cs.CV | cs.AI | cs.CLPDF
Kotaro Funakoshi, Yaoxiong Zhu
TL;DR: 本文提出了一种跨语言手语者情感识别方法,通过结合日语手语数据集eJSL和英国手语数据集BOBSL,解决了手语中语法与情感面部表情重叠及训练数据稀缺的挑战。研究表明,利用口语文本情感识别缓解数据不足、选择关键时间片段以及融合手部运动信息能有效提升识别性能,并建立了优于口语大语言模型的基线。
Details
Motivation: 解决手语者情感识别中语法与情感面部表情重叠的理论挑战以及训练数据稀缺的实践挑战。
Result: 在eJSL和BOBSL数据集上验证了方法的有效性,建立了比口语大语言模型更强的基线,实现了手语情感识别的性能提升。
Insight: 创新点包括跨语言利用口语文本缓解数据稀缺、时间片段选择的重要性以及融合手部运动信息;客观分析表明该方法为手语情感识别提供了数据增强和多模态融合的新思路。
Abstract: Recognition of signers’ emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.
[44] See It Before You Grab It: Deep Learning-based Action Anticipation in Basketball cs.CVPDF
Arnau Barrera Roy, Albert Clapés Sintes
TL;DR: 该论文提出了篮球视频中的动作预测任务,重点预测投篮后哪支球队将获得篮板球。作者构建了一个包含10万个视频片段、超过300小时素材和2000多个手动标注篮板事件的新数据集,并应用了最先进的深度学习动作预测方法进行基准测试。此外,还探索了篮板分类和篮板识别两个辅助任务,展示了该数据集在篮球视频理解中的广泛应用潜力。
Details
Motivation: 尽管计算机视觉在体育分析中已有广泛应用,如球员追踪、动作定位等,但在体育视频中预测动作发生前的结果(如篮板球归属)尚未得到充分关注。该研究旨在填补这一空白,通过预测投篮后的篮板球归属,为实时自动转播和赛后分析提供决策支持。
Result: 论文报告了使用最先进动作预测方法在新建数据集上的综合基线结果,这是深度学习技术首次应用于篮球篮板预测。实验结果表明,预测篮板球归属既具有可行性,也存在固有挑战,为动态多智能体体育场景的预测建模提供了有价值的见解。
Insight: 主要创新点包括:1)首次定义了篮球广播视频中的动作预测任务(预测篮板球归属);2)构建了首个大规模、手动标注的篮球篮板预测数据集;3)将深度学习动作预测方法首次应用于篮球篮板预测,并探索了相关辅助任务。从客观角度看,该研究为体育视频分析开辟了新的预测性任务方向,其数据集和方法可为动态团队运动中的时序预测问题提供借鉴。
Abstract: Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.
[45] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression? cs.CV | cs.AI | cs.CLPDF
Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni
TL;DR: 本文介绍了首个用于评估视觉语言模型在视觉文本压缩(VTC)下长上下文理解能力的基准测试VTCBench,包含检索、推理和记忆三个子任务,并发现尽管主流VLM能较好解码文本,但在VTC压缩信息下的长上下文理解能力普遍较差。
Details
Motivation: 视觉文本压缩(VTC)技术能显著压缩长文本的token数量,但其对视觉语言模型核心长上下文理解能力的影响尚未得到充分研究,因此需要建立专门的基准进行系统性评估。
Result: 在VTCBench基准上对领先的开源和专有模型进行全面评估,结果表明,大多数VLM在VTC压缩信息下表现出令人惊讶的差的长上下文理解能力,无法捕捉上下文中的长关联或依赖关系。
Insight: 论文的创新点在于首次构建了针对VTC场景的长上下文理解基准,揭示了VLM在高效压缩信息下的能力短板,为设计更高效、可扩展的VLM提供了重要基础。从客观角度看,该研究填补了VTC能力评估的空白,其多维度任务设计(检索、推理、记忆)和模拟真实场景的VTCBench-Wild具有借鉴意义。
Abstract: The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model’s ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
[46] Step-GUI Technical Report cs.CVPDF
Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng
TL;DR: 本文提出了一种用于图形用户界面(GUI)自动化的自演化训练流水线,通过校准步进奖励系统将模型生成轨迹转化为可靠训练信号,并基于此构建了Step-GUI模型系列。同时,论文设计了GUI-MCP协议以实现跨设备标准化接口与隐私保护,并发布了基于真实移动使用模式的AndroidDaily基准测试。
Details
Motivation: 解决多模态大语言模型在GUI自动化中高质量训练数据获取成本高、标注可靠性低,以及实际部署时跨设备接口标准化与用户隐私保护的问题。
Result: Step-GUI模型(特别是8B版本)在多个基准测试上达到SOTA水平:AndroidWorld(80.2%)、OSWorld(48.5%)、ScreenShot-Pro(62.6%),并在新提出的AndroidDaily基准上取得静态动作89.91%和端到端任务52.50%的准确率。训练数据标注准确率超过90%,成本降低10-100倍。
Insight: 创新点包括:1) 基于校准步进奖励系统的自演化训练流水线,低成本生成高可靠训练数据;2) 首个用于GUI自动化的分层模型上下文协议(GUI-MCP),结合底层原子操作与高层任务委派,支持敏感数据本地处理的高隐私执行;3) 基于真实日常使用模式构建的AndroidDaily基准,更贴近实际应用评估。
Abstract: Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.
[47] Evaluation of deep learning architectures for wildlife object detection: A comparative study of ResNet and Inception cs.CVPDF
Malach Obisa Amonga, Benard Osero, Edna Too
TL;DR: 本研究比较了ResNet-101和Inception v3两种深度学习架构在野生动物目标检测任务中的性能。通过在一个野生动物图像数据集上进行训练和评估,发现Inception v3在分类准确率和平均精度均值上略优于ResNet-101,两者均能有效处理复杂环境下的检测任务,但也面临物种视觉相似性、光照差和遮挡等挑战。
Details
Motivation: 野生动物目标检测对于生物多样性保护、生态监测和栖息地保护至关重要,但常受环境变化、物种间视觉相似性和类内多样性等因素的挑战。本研究旨在评估ResNet-101和Inception v3这两种深度学习架构在此复杂条件下的有效性。
Result: 在野生动物图像数据集上,ResNet-101实现了94%的分类准确率和0.91的平均精度均值,而Inception v3达到了95%的分类准确率和0.92的平均精度均值,两者均表现出色,其中Inception v3略优。
Insight: 论文的创新点在于对ResNet-101和Inception v3在野生动物检测任务中的系统性比较,揭示了Inception v3通过并行卷积实现的多尺度特征提取在复杂条件下可能更具优势。从客观角度看,这为保护导向的计算机视觉应用提供了可靠的模型选择依据,强调了架构设计对处理环境变异性的重要性。
Abstract: Wildlife object detection plays a vital role in biodiversity conservation, ecological monitoring, and habitat protection. However, this task is often challenged by environmental variability, visual similarities among species, and intra-class diversity. This study investigates the effectiveness of two individual deep learning architectures ResNet-101 and Inception v3 for wildlife object detection under such complex conditions. The models were trained and evaluated on a wildlife image dataset using a standardized preprocessing approach, which included resizing images to a maximum dimension of 800 pixels, converting them to RGB format, and transforming them into PyTorch tensors. A ratio of 70:30 training and validation split was used for model development. The ResNet-101 model achieved a classification accuracy of 94% and a mean Average Precision (mAP) of 0.91, showing strong performance in extracting deep hierarchical features. The Inception v3 model performed slightly better, attaining a classification accuracy of 95% and a mAP of 0.92, attributed to its efficient multi-scale feature extraction through parallel convolutions. Despite the strong results, both models exhibited challenges when detecting species with similar visual characteristics or those captured under poor lighting and occlusion. Nonetheless, the findings confirm that both ResNet-101 and Inception v3 are effective models for wildlife object detection tasks and provide a reliable foundation for conservation-focused computer vision applications.
[48] VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics cs.CV | cs.MMPDF
Opeyemi Bamigbade, Mark Scanlon, John Sheppard
TL;DR: 本文提出了一种名为VAAS(Vision-Attention Anomaly Scoring)的双模块框架,用于数字取证中的图像篡改检测。该框架结合了基于Vision Transformer的全局注意力异常估计和源自SegFormer嵌入的局部块级自一致性评分,以生成连续且可解释的异常分数,反映篡改的位置和程度。
Details
Motivation: AI驱动的图像生成技术带来了新挑战,现有方法难以检测视觉一致的伪造图像,且缺乏对篡改强度的显式度量,限制了量化篡改严重性的能力。
Result: 在DF2023和CASIA v2.0数据集上的评估表明,VAAS在F1和IoU指标上取得了有竞争力的性能,并通过注意力引导的异常图增强了视觉可解释性。
Insight: 创新点在于将全局注意力机制与局部自一致性评分相结合,提供连续、可解释的异常分数,从而在定量检测与人类可理解的推理之间架起桥梁,支持透明可靠的图像完整性评估。
Abstract: Recent advances in AI-driven image generation have introduced new challenges for verifying the authenticity of digital evidence in forensic investigations. Modern generative models can produce visually consistent forgeries that evade traditional detectors based on pixel or compression artefacts. Most existing approaches also lack an explicit measure of anomaly intensity, which limits their ability to quantify the severity of manipulation. This paper introduces Vision-Attention Anomaly Scoring (VAAS), a novel dual-module framework that integrates global attention-based anomaly estimation using Vision Transformers (ViT) with patch-level self-consistency scoring derived from SegFormer embeddings. The hybrid formulation provides a continuous and interpretable anomaly score that reflects both the location and degree of manipulation. Evaluations on the DF2023 and CASIA v2.0 datasets demonstrate that VAAS achieves competitive F1 and IoU performance, while enhancing visual explainability through attention-guided anomaly maps. The framework bridges quantitative detection with human-understandable reasoning, supporting transparent and reliable image integrity assessment. The source code for all experiments and corresponding materials for reproducing the results are available open source.
[49] DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations cs.CVPDF
Yuxiang Shi, Zhe Li, Yanwen Wang, Hao Zhu, Xun Cao
TL;DR: DeX-Portrait提出了一种新颖的肖像动画方法,能够通过解耦的姿势和表情信号生成富有表现力的动画。该方法将姿势表示为显式的全局变换,将表情表示为隐式的潜在编码,并设计了运动训练器、双分支条件机制和渐进式混合无分类器引导来实现高质量的解耦控制。
Details
Motivation: 解决现有基于扩散模型的肖像动画方法无法实现头部姿势和面部表情的高保真解耦控制的问题,以支持仅表情或仅姿势的编辑和动画应用。
Result: 实验表明,该方法在动画质量和解耦可控性方面优于现有最先进的基线模型。
Insight: 创新点在于将姿势和表情分别用显式和隐式表示进行解耦,并设计了专门的双分支条件注入机制和渐进式混合无分类器引导策略,实现了对身份一致性的更好保持和对两个运动维度的独立精确控制。
Abstract: Portrait animation from a single source image and a driving video is a long-standing problem. Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation. However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation. To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals. Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code. First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals. Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention. Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency. Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.
[50] EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration cs.CVPDF
Daiqing Wu, Dongbao Yang, Can Ma. Yu Zhou
TL;DR: 本文提出EmoCaliber,一个用于视觉情感理解(VEC)的置信度感知多模态大语言模型(MLLM)。针对现有MLLM将VEC视为确定性任务、忽略情感感知主观性的问题,该模型通过结构化推理、置信度言语化和校准的三阶段训练框架,使其能够输出情感预测的同时表达自身置信度,从而增强系统的可靠性。
Details
Motivation: 现有基于MLLM的VEC范式通常将任务视为确定性的,要求模型为每张图像输出单一、确定的情感标签。这未能充分考虑情感感知固有的主观性,忽略了对于不同观察者可能同样合理的替代解释,限制了模型的可靠性。
Result: 在统一的基准测试集VECBench上进行公平全面的评估,EmoCaliber在情感预测和置信度估计两方面均展现出相对于现有方法的整体优越性。
Insight: 核心创新在于将置信度言语化(confidence verbalization)和校准(calibration)引入VEC任务,使MLLM能够表达其对预测的不确定性,这为处理主观性任务提供了一种增强可靠性的可行路径。所提出的三阶段渐进式训练框架(结构化推理、置信度言语化、置信度校准)是实现这一目标的有效方法。
Abstract: Visual Emotion Comprehension (VEC) aims to infer sentiment polarities or emotion categories from affective cues embedded in images. In recent years, Multimodal Large Language Models (MLLMs) have established a popular paradigm in VEC, leveraging their generalizability to unify VEC tasks defined under diverse emotion taxonomies. While this paradigm achieves notable success, it typically formulates VEC as a deterministic task, requiring the model to output a single, definitive emotion label for each image. Such a formulation insufficiently accounts for the inherent subjectivity of emotion perception, overlooking alternative interpretations that may be equally plausible to different viewers. To address this limitation, we propose equipping MLLMs with capabilities to verbalize their confidence in emotion predictions. This additional signal provides users with an estimate of both the plausibility of alternative interpretations and the MLLMs’ self-assessed competence, thereby enhancing reliability in practice. Building on this insight, we introduce a three-stage training framework that progressively endows with structured reasoning, teaches to verbalize confidence, and calibrates confidence expression, culminating in EmoCaliber, a confidence-aware MLLM for VEC. Through fair and comprehensive evaluations on the unified benchmark VECBench, EmoCaliber demonstrates overall superiority against existing methods in both emotion prediction and confidence estimation. These results validate the effectiveness of our approach and mark a feasible step toward more reliable VEC systems. Project page: https://github.com/wdqqdw/EmoCaliber.
[51] An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain cs.CVPDF
João Daniel Silva, Joao Magalhaes, Devis Tuia, Bruno Martins
TL;DR: 本文提出了一种名为GeoMELT的高效编码器模型,用于处理遥感领域的视觉与语言多任务学习,特别是图像描述生成和跨模态检索任务,旨在降低大型视觉语言模型的计算成本。
Details
Motivation: 解决大型视觉语言模型在遥感领域应用时参数多、训练和推理成本高昂的问题,为大多数机构提供一种参数高效且能处理多任务的紧凑模型。
Result: 在已建立的基准测试中,GeoMELT模型展示了其有效性和效率,但摘要未具体说明定量结果或是否达到SOTA水平。
Insight: 创新点在于探索编码器架构来处理遥感图像到文本生成和跨模态检索的多任务学习,实现模型紧凑化,为资源受限场景提供了可行的解决方案。
Abstract: The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining compact in terms of the number of parameters. In particular, our model tackles combinations of tasks that are not typically explored in a unified model: the generation of text from remote sensing images and cross-modal retrieval. The results of our GeoMELT model - named from Multi-task Efficient Learning Transformer - in established benchmarks confirm the efficacy and efficiency of the proposed approach.
[52] BLANKET: Anonymizing Faces in Infant Video Recordings cs.CVPDF
Ditmar Hadera, Jan Cech, Miroslav Purkrabek, Matej Hoffmann
TL;DR: 本文提出了一种名为BLINKET的新方法,用于在婴儿视频记录中对人脸进行匿名化处理,同时保留关键的面部属性。该方法通过扩散模型生成与原始身份兼容的新面孔,并通过时间一致的面部交换技术将其无缝融入视频帧中。
Details
Motivation: 动机是确保涉及人类受试者(尤其是婴儿)的视频数据得到合乎伦理的使用,这需要强大的匿名化方法来解决隐私保护问题。
Result: 方法在婴儿短视频数据集上进行了评估,并与DeepPrivacy2进行了比较。评估指标包括去识别化程度、面部属性保留、对下游任务(如人体姿态估计)的影响以及伪影存在情况。结果表明,该方法在所有评估方面均优于DeepPrivacy2。
Insight: 创新点在于结合了扩散模型进行生成和时间一致的面部交换,实现了在匿名化过程中保持面部属性和时间连贯性,为婴儿视频数据的隐私保护提供了更优的解决方案。
Abstract: Ensuring the ethical use of video data involving human subjects, particularly infants, requires robust anonymization methods. We propose BLANKET (Baby-face Landmark-preserving ANonymization with Keypoint dEtection consisTency), a novel approach designed to anonymize infant faces in video recordings while preserving essential facial attributes. Our method comprises two stages. First, a new random face, compatible with the original identity, is generated via inpainting using a diffusion model. Second, the new identity is seamlessly incorporated into each video frame through temporally consistent face swapping with authentic expression transfer. The method is evaluated on a dataset of short video recordings of babies and is compared to the popular anonymization method, DeepPrivacy2. Key metrics assessed include the level of de-identification, preservation of facial attributes, impact on human pose estimation (as an example of a downstream task), and presence of artifacts. Both methods alter the identity, and our method outperforms DeepPrivacy2 in all other respects. The code is available as an easy-to-use anonymization demo at https://github.com/ctu-vras/blanket-infant-face-anonym.
[53] GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models cs.CVPDF
Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen
TL;DR: 本文提出了GRAN-TED范式,旨在为扩散模型生成鲁棒、对齐且细致的文本嵌入。其核心贡献包括:1)引入了一个仅包含文本的基准TED-6K,用于高效评估文本编码器的表示质量,而无需昂贵的端到端训练;2)基于该评估框架,开发了一种新颖的两阶段训练范式,通过在多模态大语言模型上进行微调并结合分层加权方法,来训练出性能优越的文本编码器。
Details
Motivation: 当前文本到图像/视频扩散模型中,文本编码器的发展面临两大挑战:一是缺乏能够可靠预测下游生成性能的高效评估框架;二是难以有效地将预训练语言模型适配于视觉合成任务。
Result: 实验表明,GRAN-TED编码器在TED-6K基准上达到了最先进的性能,并且在文本到图像和文本到视频生成任务中带来了显著的性能提升。
Insight: 主要创新点在于:1)提出了一个轻量、统一的适配器和仅文本的基准TED-6K,用于高效、鲁棒地评估文本编码器质量,其得分与下游生成性能强相关;2)设计了一种两阶段训练范式,先对多模态大语言模型进行微调以获得更好的视觉表示,再通过分层加权方法提取更细致、更有效的文本特征。
Abstract: The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder’s representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder’s effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.
[54] On the Effectiveness of Textual Prompting with Lightweight Fine-Tuning for SAM3 Remote Sensing Segmentation cs.CVPDF
Roni Blushtein-Livnon, Osher Rafaeli, David Ioffe, Amir Boger, Karen Sandberg Esquenazi
TL;DR: 本文评估了SAM3概念驱动框架在遥感图像分割任务中,结合文本提示与轻量级微调的有效性。研究发现,结合语义和几何线索的混合提示策略在所有目标和指标上表现最佳,而纯文本提示表现最差,特别是对于不规则形状的目标。轻量级微调为几何规则和视觉显著的目标提供了实用的性能-努力权衡,且性能在零样本推理和微调之间提升,但随着监督规模增加,收益递减。
Details
Motivation: 遥感图像分割受限于标注数据的有限性,以及用于训练基础模型的航空影像与自然图像之间的差异,这促使在有限监督下进行有效适应。
Result: 在四个目标类型上的实验表明,混合提示策略性能最高;纯文本提示性能最低,尤其对不规则目标。轻量级微调在几何规则和视觉显著目标上提供了实用的性能-努力权衡。性能在零样本推理和微调之间提升,但监督规模增加后收益递减。精确度和IoU之间的持续差距表明欠分割和边界不准确仍是主要错误模式。
Insight: 创新点在于系统评估了SAM3在遥感领域的文本提示与轻量级微调策略,揭示了混合提示的有效性以及纯文本提示在语义对齐上的局限性。客观分析认为,其核心洞察是轻量级几何标注足以实现有效适应,为数据稀缺的遥感分割提供了高效的微调范式。
Abstract: Remote sensing (RS) image segmentation is constrained by the limited availability of annotated data and a gap between overhead imagery and natural images used to train foundational models. This motivates effective adaptation under limited supervision. SAM3 concept-driven framework generates masks from textual prompts without requiring task-specific modifications, which may enable this adaptation. We evaluate SAM3 for RS imagery across four target types, comparing textual, geometric, and hybrid prompting strategies, under lightweight fine-tuning scales with increasing supervision, alongside zero-shot inference. Results show that combining semantic and geometric cues yields the highest performance across targets and metrics. Text-only prompting exhibits the lowest performance, with marked score gaps for irregularly shaped targets, reflecting limited semantic alignment between SAM3 textual representations and their overhead appearances. Nevertheless, textual prompting with light fine-tuning offers a practical performance-effort trade-off for geometrically regular and visually salient targets. Across targets, performance improves between zero-shot inference and fine-tuning, followed by diminishing returns as the supervision scale increases. Namely, a modest geometric annotation effort is sufficient for effective adaptation. A persistent gap between Precision and IoU further indicates that under-segmentation and boundary inaccuracies remain prevalent error patterns in RS tasks, particularly for irregular and less prevalent targets.
[55] FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision cs.CVPDF
Tobias Kirschstein, Simon Giebenhain, Matthias Nießner
TL;DR: FlexAvatar是一种从单张图像创建高质量完整3D头部化身的方法。它通过基于Transformer的3D肖像动画模型与可学习的数据源标记(偏置汇)来解决单目训练导致3D重建不完整的问题,实现了单目与多视图数据的统一训练,从而在推理时结合了单目数据的强泛化能力和多视图监督的完整3D重建能力。
Details
Motivation: 解决从单张图像创建3D头部化身时,由于多视图数据有限和单目训练容易导致3D重建不完整的问题,其根本原因在于从单目视频学习时驱动信号与目标视角的纠缠。
Result: 在单视图、少样本和单目化身创建任务的广泛评估中,FlexAvatar验证了其有效性,能够生成具有逼真面部动画的完整3D头部化身,而许多现有方法在视角外推方面存在困难。
Insight: 创新点包括引入基于Transformer的3D肖像动画模型与可学习的偏置汇标记,实现单目和多视图数据的统一训练;训练过程产生平滑的潜在化身空间,便于身份插值和灵活适应任意数量的输入观测;结合了单目数据的泛化优势和多视图监督的完整性优势。
Abstract: We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/
[56] Robust Multi-view Camera Calibration from Dense Matches cs.CVPDF
Johannes Hägerlind, Bao-Long Tran, Urs Waldmann, Per-Erik Forssén
TL;DR: 本文提出了一种鲁棒的多视角相机标定方法,通过分析运动恢复结构(SfM)流程中的关键组件,改进了对应点采样和视图增量添加策略,以提升相机内参和外参估计的精度与鲁棒性,尤其适用于具有强径向畸变的相机。
Details
Motivation: 解决多视角相机(如动物行为研究或监控视频分析中的固定相机阵列)标定中,现有SfM方法在精度和鲁棒性方面仍存在的挑战,特别是在处理强径向畸变时。
Result: 在定量评估中,所提方法在强径向畸变相机上的表现显著优于基线(79.9% vs. 40.4% 的VGGT),并在全局SfM设置中验证了其有效性。
Insight: 创新点在于系统研究了密集匹配对应点的最优子采样策略以及视图增量添加的选择标准,这些设计选择可提升SfM流程的鲁棒性,适用于多种相机配置。
Abstract: Estimating camera intrinsics and extrinsics is a fundamental problem in computer vision, and while advances in structure-from-motion (SfM) have improved accuracy and robustness, open challenges remain. In this paper, we introduce a robust method for pose estimation and calibration. We consider a set of rigid cameras, each observing the scene from a different perspective, which is a typical camera setup in animal behavior studies and forensic analysis of surveillance footage. Specifically, we analyse the individual components in a structure-from-motion (SfM) pipeline, and identify design choices that improve accuracy. Our main contributions are: (1) we investigate how to best subsample the predicted correspondences from a dense matcher to leverage them in the estimation process. (2) We investigate selection criteria for how to add the views incrementally. In a rigorous quantitative evaluation, we show the effectiveness of our changes, especially for cameras with strong radial distortion (79.9% ours vs. 40.4 vanilla VGGT). Finally, we demonstrate our correspondence subsampling in a global SfM setting where we initialize the poses using VGGT. The proposed pipeline generalizes across a wide range of camera setups, and could thus become a useful tool for animal behavior and forensic analysis.
[57] IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning cs.CV | cs.AIPDF
Yuanhang Li, Yiren Song, Junzhe Bai, Xinran Liang, Hu Yang
TL;DR: IC-Effect是一个基于指令引导和扩散Transformer(DiT)的少样本视频特效编辑框架,能够合成火焰、粒子、卡通角色等复杂特效,同时严格保持时空一致性。它通过利用源视频作为上下文条件、采用两阶段训练策略(通用编辑适应和特效特定学习)以及引入时空稀疏标记化来提高效率,并在一个包含15种高质量视觉风格的配对VFX数据集上进行了验证。
Details
Motivation: 视频VFX编辑面临特效需无缝融入背景、背景必须完全不变以及需从有限配对数据中高效学习特效模式的挑战,而现有视频编辑模型无法满足这些要求,因此提出IC-Effect来解决这些问题。
Result: 广泛的实验表明,IC-Effect能够实现高质量、可控且时间一致的特效编辑,为视频创作开辟了新可能性,但摘要未明确提及具体基准测试或与SOTA模型的定量比较结果。
Insight: 创新点包括:利用源视频作为上下文条件以精确保留背景和自然注入特效;采用两阶段训练策略(通用编辑适应和Effect-LoRA)确保强指令跟随和鲁棒特效建模;引入时空稀疏标记化以在保持高保真度的同时显著减少计算量;并发布了一个涵盖15种视觉风格的配对VFX编辑数据集。
Abstract: We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.
[58] Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift cs.CVPDF
Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu
TL;DR: 本文提出了一种新的训练范式HALD,通过重新引入硬标签作为校准信号,以缓解软标签在有限图像裁剪下导致的局部语义漂移问题,从而在数据集蒸馏和大规模分类任务中提升模型泛化性能。
Details
Motivation: 当每张图像仅使用有限数量的裁剪时,教师模型生成的软标签容易发生局部语义漂移,即裁剪图像在视觉上可能类似于其他类别,导致其软嵌入偏离原始图像的真实语义,从而引入系统误差和训练测试分布不对齐。
Result: 在ImageNet-1K上,仅使用285M存储的软标签实现了42.7%的准确率,比之前的最先进方法LPLD提高了9.0%。在数据集蒸馏和常规大规模分类基准测试中均显示出一致的泛化改进。
Insight: 创新点在于重新审视并理论分析了硬标签在缓解局部语义漂移中的作用,提出将硬标签作为内容无关的锚点来校准语义漂移,并与软标签结合形成混合监督,从而恢复视觉内容与语义监督之间的对齐。这为软标签主导的训练范式提供了互补工具。
Abstract: Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few soft-label supervision and demonstrate that hybridizing soft and hard labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which leverages hard labels as intermediate corrective signals while retaining the fine-grained advantages of soft labels. Extensive experiments on dataset distillation and large-scale conventional classification benchmarks validate our approach, showing consistent improvements in generalization. On ImageNet-1K, we achieve 42.7% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%. Our findings re-establish the importance of hard labels as a complementary tool, and call for a rethinking of their role in soft-label-dominated training.
[59] Stylized Synthetic Augmentation further improves Corruption Robustness cs.CV | cs.LGPDF
Georg Siedel, Rojan Regmi, Abhirami Anand, Weijia Shao, Silvia Vock
TL;DR: 本文提出了一种结合合成图像数据和神经风格迁移的训练数据增强方法,旨在提升深度视觉模型对常见图像损坏的鲁棒性。研究发现,尽管风格迁移会降低合成图像在FID指标上的质量,但这些图像对模型训练却有意想不到的益处。通过系统实验分析,证明了风格化与合成数据能有效互补,并与TrivialAugment等基于规则的增强技术协同工作。该方法在多个小规模图像分类基准上实现了最先进的鲁棒性。
Details
Motivation: 解决深度视觉模型对常见图像损坏(如噪声、模糊等)的脆弱性问题,通过数据增强提升模型的鲁棒性。
Result: 在CIFAR-10-C、CIFAR-100-C和TinyImageNet-C基准上分别达到93.54%、74.9%和50.86%的鲁棒准确率,实现了最先进的(SOTA)性能。
Insight: 创新点在于将合成数据与神经风格迁移结合,形成互补增强策略;客观分析表明,这种组合能有效利用风格化带来的域变化,即使图像质量下降,仍能提升模型泛化能力,且与特定规则增强方法兼容。
Abstract: This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common FID metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively
[60] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning cs.CVPDF
Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng
TL;DR: 本文提出了Skyra,一种专门的多模态大语言模型(MLLM),用于检测AI生成视频。它通过识别视频中人类可感知的视觉伪影,并将其作为检测和解释的可靠证据,从而超越传统的二元分类方法。
Details
Motivation: AI驱动的视频生成技术被滥用引发了严重的社会担忧,现有检测方法大多仅限于二元分类,缺乏对人类可理解的解释,因此需要可靠且可解释的AI生成视频检测器。
Result: 在包含超过十种最先进视频生成器生成的3K高质量样本的ViF-Bench基准测试中,广泛的实验表明Skyra在多个基准上超越了现有方法。
Insight: 创新点在于将视觉伪影作为可解释的、基于事实的证据用于检测,并构建了首个大规模、细粒度人工标注的AI生成视频伪影数据集ViF-CoT-4K,以及一个两阶段训练策略来系统提升模型的时空伪影感知、解释能力和检测精度。
Abstract: The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model’s spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
[61] VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression cs.CVPDF
Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski
TL;DR: 本文提出VLIC(Vision-Language Models for Image Compression),一种基于扩散模型的图像压缩系统,它利用视觉语言模型(VLMs)的零样本视觉推理能力来模拟人类对图像质量的二元判断,从而替代传统的感知损失函数,以更好地对齐人类感知偏好进行图像压缩。
Details
Motivation: 传统图像压缩评估中,如MSE等失真函数与人类感知偏好不一致,而现有方法依赖于在大规模人类心理视觉判断数据集上校准的神经网络作为感知损失;本文发现先进的视觉语言模型能够零样本复现人类的二元选择判断,因此探索利用VLMs的这种能力来指导压缩模型训练,以更直接地实现人类对齐的图像压缩。
Result: 在人类对齐的视觉压缩任务上,根据感知度量和大规模用户研究,VLIC在校准了VLM判断后,在不同数据集上取得了有竞争力或最先进的性能。
Insight: 创新点在于首次将视觉语言模型作为零样本的感知评判者,直接用于扩散模型的后训练偏好对齐,避免了将VLM判断蒸馏到单独感知损失网络的复杂过程;这为利用大模型的通用视觉推理能力来指导特定任务(如图像压缩)的优化提供了新思路。
Abstract: Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic
[62] End-to-End Training for Autoregressive Video Diffusion via Self-Resampling cs.CVPDF
Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei
TL;DR: 本文提出了一种名为’重采样强制’的端到端训练框架,用于解决自回归视频扩散模型中的曝光偏差问题。该框架通过自重采样方案在训练时模拟推理阶段的模型误差,并结合稀疏因果掩码实现并行训练。此外,还引入了无参数的历史路由机制,以高效生成长序列视频。
Details
Motivation: 自回归视频扩散模型存在训练-测试不匹配导致的曝光偏差问题,现有方法通常依赖后训练、双向教师模型或在线判别器,本文旨在实现一个无需教师的端到端解决方案。
Result: 实验表明,该方法在性能上与基于蒸馏的基线方法相当,并且由于支持原生长度训练,在更长视频上表现出更优的时间一致性。
Insight: 核心创新点在于提出了一个无需教师模型的端到端训练框架,通过自重采样模拟推理误差来缓解曝光偏差,并利用稀疏因果掩码和动态历史路由机制,在保证时序因果性的同时实现了高效的并行训练与长序列生成。
Abstract: Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.
[63] GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection cs.CVPDF
Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall
TL;DR: 本文提出GateFusion,一种用于主动说话人检测(ASD)的新架构。它结合了强大的预训练单模态编码器和一个新颖的分层门控融合解码器(HiGate),通过可学习的双模态条件门在Transformer骨干网络的多个层级自适应地注入跨模态上下文特征,以实现渐进式、多深度的融合。此外,论文还引入了掩码对齐损失和过正惩罚两个辅助目标来增强多模态学习。该方法在多个ASD基准测试上取得了新的最先进性能。
Details
Motivation: 解决现有ASD方法中晚期融合策略难以捕捉细粒度跨模态交互的问题,这对于在无约束场景下实现鲁棒性能至关重要。
Result: 在多个具有挑战性的ASD基准测试上取得了新的最先进结果:在Ego4D-ASD上达到77.8% mAP(提升9.4%),在UniTalk上达到86.1% mAP(提升2.9%),在WASD上达到96.1% mAP(提升0.5%),并在AVA-ActiveSpeaker上取得了有竞争力的性能。域外实验和消融研究验证了模型的泛化能力和各组件的作用。
Insight: 核心创新点是分层门控融合解码器(HiGate),它通过可学习的门控机制在Transformer的多个层级实现渐进式、自适应的跨模态特征融合。此外,掩码对齐损失和过正惩罚这两个辅助目标的设计,分别用于对齐单模态与多模态输出以及抑制仅由视频触发的虚假激活,是多模态学习任务中可借鉴的有效策略。
Abstract: Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.
[64] Multi-View Foundation Models cs.CVPDF
Leo Segre, Or Hirschorn, Shai Avidan
TL;DR: 本文提出了一种将单视图基础模型转换为多视图基础模型的方法,通过在Transformer架构中引入3D感知的中间注意力层,使模型能够处理同一3D场景的多个图像输入,并输出特征一致的特征图,从而提升多视图任务中的特征匹配性能。
Details
Motivation: 现有基础模型(如DINO、SAM、CLIP)通常处理单张RGB图像,但在多视图场景下,它们对同一3D点的特征输出可能不一致,这限制了其在多视图应用中的效果。本文旨在解决多视图特征一致性问题,避免构建复杂的3D特征模型。
Result: 定量实验表明,该方法在特征匹配方面相比当前基础模型有显著提升,并在表面法线估计和多视图分割等任务中进行了验证。
Insight: 创新点在于通过引入3D感知注意力层增强Transformer基础模型,实现跨视图特征对齐,直接在图像空间操作,无需显式构建3D模型,为多视图计算机视觉任务提供了高效解决方案。
Abstract: Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.
[65] Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering cs.CV | cs.GRPDF
Divam Gupta, Anuj Pahuja, Nemanja Bartolovic, Tomas Simon, Forrest Iandola
TL;DR: 本文提出了高斯像素编解码化身(GPiCA),一种从多视角图像生成并能在移动设备上高效渲染的逼真头部化身。该方法采用了一种独特的混合表示,结合了三角形网格和各向异性3D高斯分布,以在保持逼真外观的同时最大化内存和渲染效率。网格高效表示面部皮肤等表面区域,而3D高斯分布则有效处理头发和胡须等非表面区域。通过一个统一的、可微分的渲染流水线,将网格作为3D高斯泼溅体渲染范式中的一个半透明层进行渲染。
Details
Motivation: 解决在移动设备上高效渲染逼真头部化身(尤其是包含复杂外观如头发)的挑战,现有纯网格方法难以处理非表面细节,而纯高斯方法则渲染效率较低。
Result: 结果表明,GPiCA在渲染质量上达到了纯高斯化身的逼真度,同时在渲染性能上匹配了基于网格的化身。
Insight: 创新点在于提出了三角形网格与3D高斯的混合表示,并设计了统一的、可微分的渲染流水线,将网格整合进体渲染框架,实现了外观质量与渲染效率的平衡。这种混合表示策略为解决复杂几何与外观的高效渲染提供了新思路。
Abstract: We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.
[66] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models cs.CVPDF
Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu
TL;DR: DiffusionVL是一种能够将任意自回归模型转换为扩散视觉语言模型(dVLM)的方法,通过简单微调实现范式转换,在多项基准测试中取得显著性能提升和推理加速。
Details
Motivation: 解决现有扩散视觉语言模型因基础扩散语言模型能力限制而性能落后于主流自回归模型的问题,探索基于现有强大自回归模型构建高性能dVLM的可能性。
Result: 在仅使用不到5%训练数据的情况下,DiffusionVL在MMMU-Pro(视觉)基准上提升34.4%,在MME(认知)基准上提升37.5%,推理速度提升2倍,性能与LLaVA风格视觉指令调优模型相当。
Insight: 创新点包括:1)实现了自回归模型到扩散范式的有效转换;2)引入支持任意长度生成和KV缓存重用的块解码设计以加速推理;3)证明了跨范式迁移的可行性,为利用现有AR模型资源提供了新途径。
Abstract: In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.
[67] In Pursuit of Pixel Supervision for Visual Pre-training cs.CVPDF
Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang
TL;DR: 本文提出了一种名为Pixio的增强型掩码自编码器(MAE)模型,用于从像素级监督中进行视觉预训练。该模型在20亿张网络爬取的图像上,通过自筛选策略进行训练,无需大量人工标注。Pixio在多种下游任务(如单目深度估计、前馈3D重建、语义分割和机器人学习)中表现出色,性能与类似规模的DINOv3相当或更优。
Details
Motivation: 论文的动机是探索像素作为视觉信息基本来源的潜力,认为自编码器这一经典范式在自监督学习中仍具竞争力,旨在通过更难的预训练任务和更强的架构,从像素中学习到强大的视觉表示。
Result: Pixio在多个真实世界下游任务中表现优异,例如在Depth Anything(单目深度估计)和MapAnything(前馈3D重建)等基准上,其性能达到或超越了类似规模的DINOv3模型,展示了其竞争力。
Insight: 论文宣称的创新点在于提出了一个增强的MAE架构(Pixio),结合了更具挑战性的预训练任务、更强大的模型能力以及大规模自筛选数据策略。从客观角度看,其核心洞察是证明了像素空间的自监督学习可以作为潜在空间方法(如DINO系列)的一个有前景的替代和补充方案,强调了从原始像素中直接学习表示的持续有效性。
Abstract: At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed “Pixio”, is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.
[68] Spatia: Video Generation with Updatable Spatial Memory cs.CV | cs.AIPDF
Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu
TL;DR: 本文提出了Spatia,一种基于可更新空间记忆的视频生成框架,通过显式维护3D场景点云作为持久空间记忆,并结合视觉SLAM进行迭代更新,以增强视频生成中的长期空间和时间一致性。
Details
Motivation: 现有视频生成模型因视频信号的高维密集特性,难以保持长期的空间和时间一致性,Spatia旨在通过空间记忆机制解决这一问题。
Result: 未在摘要中提及具体的定量结果或基准测试,但宣称该框架能提升空间一致性,并支持显式相机控制和3D感知交互编辑等应用。
Insight: 创新点在于动态-静态解耦设计,将空间记忆(静态3D点云)与动态实体生成分离,通过视觉SLAM迭代更新记忆,为可扩展、基于记忆的视频生成提供了几何基础框架。
Abstract: Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
cs.RO [Back]
[69] HERO: Hierarchical Traversable 3D Scene Graphs for Embodied Navigation Among Movable Obstacles cs.RO | cs.AI | cs.CL | cs.CVPDF
Yunheng Wang, Yixiao Feng, Yuetong Fang, Shuning Zhang, Tan Jing
TL;DR: 本文提出HERO框架,通过构建层次化可遍历3D场景图,将可操作障碍物建模为通路,从而在可移动障碍物环境中实现更高效的具身导航。
Details
Motivation: 现有3D场景图方法基于静态世界假设,将可交互障碍物视为不可穿越,导致在真实场景中可达性低、效率差。
Result: 在部分阻塞环境中路径长度减少35.1%,在完全阻塞环境中成功率提升79.4%,显著优于基线方法。
Insight: 创新性地将可操作障碍物的物理交互性、功能语义与场景层次关系纳入可遍历性定义,突破了静态场景假设的限制。
Abstract: 3D Scene Graphs (3DSGs) constitute a powerful representation of the physical world, distinguished by their abilities to explicitly model the complex spatial, semantic, and functional relationships between entities, rendering a foundational understanding that enables agents to interact intelligently with their environment and execute versatile behaviors. Embodied navigation, as a crucial component of such capabilities, leverages the compact and expressive nature of 3DSGs to enable long-horizon reasoning and planning in complex, large-scale environments. However, prior works rely on a static-world assumption, defining traversable space solely based on static spatial layouts and thereby treating interactable obstacles as non-traversable. This fundamental limitation severely undermines their effectiveness in real-world scenarios, leading to limited reachability, low efficiency, and inferior extensibility. To address these issues, we propose HERO, a novel framework for constructing Hierarchical Traversable 3DSGs, that redefines traversability by modeling operable obstacles as pathways, capturing their physical interactivity, functional semantics, and the scene’s relational hierarchy. The results show that, relative to its baseline, HERO reduces PL by 35.1% in partially obstructed environments and increases SR by 79.4% in fully obstructed ones, demonstrating substantially higher efficiency and reachability.
[70] MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training cs.RO | cs.CVPDF
Zhenhan Yin, Xuanhan Wang, Jiahao Jiang, Kaiyuan Deng, Pengqi Chen
TL;DR: 本文提出MiVLA模型,通过人类-机器人相互模仿预训练,利用人类手部与机械臂之间的行为相似性,构建统一的行为先验,以解决现有视觉-语言-动作模型在相机视角、视觉外观和形态差异上的泛化能力受限问题。
Details
Motivation: 现有VLAs模型因真实机器人数据稀缺而依赖人类视频和模拟数据,但存在视角、外观和形态不匹配导致的泛化能力限制,需要一种能整合人类行为保真度和机器人操作多样性的方法。
Result: 在ARX、PiPer和LocoMan三种机器人的仿真和真实平台实验中,MiVLA在仿真任务中超越SOTA模型(如π₀、π₀.5和H-RDT)25%,在真实机器人控制任务中提升14%。
Insight: 创新点在于提出人类-机器人相互模仿预训练框架,利用左右手坐标系和运动学规则进行双向动作空间对齐,将人类数据的行为保真度与模拟机器人数据的操作多样性整合到统一模型中,增强了跨形态和场景的泛化能力。
Abstract: While leveraging abundant human videos and simulated robot data poses a scalable solution to the scarcity of real-world robot data, the generalization capability of existing vision-language-action models (VLAs) remains limited by mismatches in camera views, visual appearance, and embodiment morphologies. To overcome this limitation, we propose MiVLA, a generalizable VLA empowered by human-robot mutual imitation pre-training, which leverages inherent behavioral similarity between human hands and robotic arms to build a foundation of strong behavioral priors for both human actions and robotic control. Specifically, our method utilizes kinematic rules with left/right hand coordinate systems for bidirectional alignment between human and robot action spaces. Given human or simulated robot demonstrations, MiVLA is trained to forecast behavior trajectories for one embodiment, and imitate behaviors for another one unseen in the demonstration. Based on this mutual imitation, it integrates the behavioral fidelity of real-world human data with the manipulative diversity of simulated robot data into a unified model, thereby enhancing the generalization capability for downstream tasks. Extensive experiments conducted on both simulation and real-world platforms with three robots (ARX, PiPer and LocoMan), demonstrate that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs (e.g., $\boldsymbolπ_{0}$, $\boldsymbolπ_{0.5}$ and H-RDT) by 25% in simulation, and 14% in real-world robot control tasks.
[71] mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs cs.RO | cs.AI | cs.CV | cs.LGPDF
Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees
TL;DR: 本文提出了一种名为Mimic-Video的新型视频-动作模型(VAM),旨在解决现有视觉-语言-动作模型(VLA)在机器人操控中因缺乏对物理动态的显式理解而过度依赖大规模专家演示数据的问题。该模型将预训练的大规模互联网视频模型与基于流匹配的动作解码器相结合,通过视频隐表示来生成低层机器人动作,从而将语义、视觉动态与底层控制解耦。
Details
Motivation: 现有基于大规模静态网络数据预训练的VLA模型虽然提升了语义泛化能力,但缺乏对物理因果关系和时序依赖的显式理解,导致策略学习严重依赖机器人轨迹数据,需要持续收集大规模专家数据来弥补这一缺陷。本文的动机是利用视频数据在预训练中同时捕获语义和视觉动态,从而减轻对机器人专家数据的依赖。
Result: 在模拟和真实世界的机器人操控任务上进行了广泛评估,结果表明该方法达到了最先进的性能。与传统VLA架构相比,样本效率提升了10倍,收敛速度提升了2倍。
Insight: 核心创新点在于提出了视频-动作模型(VAM)范式,利用预训练的视频模型作为物理动态和语义的联合先验知识源,并通过基于流匹配的逆动力学模型(IDM)解码动作,实现了高层视频规划与底层机器人控制的解耦。这为构建更具数据效率和物理理解能力的机器人策略模型提供了新思路。
Abstract: Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce \model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
cs.LG [Back]
[72] Task Matrices: Linear Maps for Cross-Model Finetuning Transfer cs.LG | cs.CL | cs.CVPDF
Darrin O’ Brien, Dhikshith Gajulapalli, Eric Xia
TL;DR: 本文提出了任务矩阵(Task Matrices)的概念,这是一种从基础模型嵌入状态到微调后模型嵌入状态的线性变换。研究表明,在视觉和文本模型以及十个不同数据集上,基础模型结合任务矩阵能够超越线性探针的性能,有时甚至接近微调模型的水平。这验证了预训练和微调架构之间存在跨层的线性编码,并且基于数据的近似编码方法既高效又可泛化到多个领域。
Details
Motivation: 动机在于探索在更一般的模型适应机制中,是否存在类似于可解释性研究中通过上下文提示所发现的隐式线性编码,从而验证预训练与微调模型之间是否存在跨层的线性表示。
Result: 在视觉和文本模型以及十个数据集上的实验结果表明,基础模型结合任务矩阵的性能超越了线性探针,有时接近微调模型的水平,验证了跨层线性编码的存在。
Insight: 创新点在于提出了任务矩阵这一线性映射概念,用于实现跨模型的微调迁移,并证明了基于数据的近似编码方法在效率和泛化性上的优势,为模型适应和迁移学习提供了新的线性视角。
Abstract: Results in interpretability suggest that large vision and language models learn implicit linear encodings when models are biased by in-context prompting. However, the existence of similar linear representations in more general adaptation regimes has not yet been demonstrated. In this work, we develop the concept of a task matrix, a linear transformation from a base to finetuned embedding state. We demonstrate that for vision and text models and ten different datasets, a base model augmented with a task matrix achieves results surpassing linear probes, sometimes approaching finetuned levels. Our results validate the existence of cross-layer linear encodings between pretrained and finetuned architectures. Moreover, we show that a data-based approximation for such encodings is both efficient and generalizable to multiple domains. We make our implementation publicly available.
[73] Prompt Repetition Improves Non-Reasoning LLMs cs.LG | cs.AI | cs.CLPDF
Yaniv Leviathan, Matan Kalman, Yossi Matias
TL;DR: 这篇论文发现,对于不涉及复杂推理的任务,在输入时重复提示词(Prompt Repetition)可以提升包括Gemini、GPT、Claude和Deepseek在内的流行大语言模型的性能,且不会增加生成令牌数量或延迟。
Details
Motivation: 解决在不依赖推理能力的任务场景下,如何简单有效地提升大语言模型的性能,而不增加计算开销。
Result: 论文表明该方法在多个主流模型上有效提升了性能,但未提及具体的基准测试(benchmark)或与SOTA的对比结果。
Insight: 创新点在于揭示了“提示词重复”这一简单、零成本(不增加令牌和延迟)的工程技巧对非推理任务的有效性,为模型性能优化提供了一个新的实用视角。
Abstract: When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.
[74] DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding cs.LG | cs.AI | cs.CLPDF
Ruiyi Zhang, Peijia Qin, Qi Cao, Pengtao Xie
TL;DR: DreamPRM-Code 是一种专注于代码生成的流程奖励模型,它将函数视为推理步骤,通过 Chain-of-Function 提示策略诱导模块化代码生成,并引入基于元学习的标签校正机制来优化训练数据中的噪声。该模型在测试时扩展中实现了最先进的性能。
Details
Motivation: 解决现有流程奖励模型在代码生成任务中因缺乏有意义的步骤分解和蒙特卡洛生成的部分标签噪声而效果有限的问题。
Result: 在 LiveCodeBench 基准测试中达到了 80.9% 的 pass@1 率,超越了 OpenAI o4-mini,实现了最先进的性能。
Insight: 创新点在于将函数作为推理步骤进行建模,并采用元学习进行标签校正,这为代码生成中的流程奖励建模提供了新的思路,可借鉴其模块化分解和噪声处理策略。
Abstract: Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
[75] The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems cs.LG | cs.AI | cs.CLPDF
Debu Sinha
TL;DR: 本文研究了检索增强生成(RAG)系统中基于嵌入的幻觉检测方法的根本局限性。通过应用保形预测进行理论分析,作者发现尽管在合成数据上能达到高覆盖率和零误报率,但在多个真实幻觉基准测试中,基于嵌入的方法(包括最先进的模型)均表现出不可接受的高误报率,而GPT-4作为LLM法官则表现优异,揭示了该任务可通过推理解决。作者将这种现象称为‘语义幻觉’。
Details
Motivation: 尽管RAG系统基于检索证据,但仍易产生幻觉。当前检测方法依赖语义相似性和自然语言推理,但其根本局限性尚未被严格刻画。本文旨在通过保形预测量化这些检测方法的能力极限。
Result: 在约600个样本的校准集上,对合成幻觉(Natural Questions)实现了94%的覆盖率和0%的误报率。然而,在三个真实幻觉基准测试(HaluEval, RAGTruth, WikiBio)上,基于嵌入的方法(包括OpenAI text-embedding-3-large和交叉编码器模型)误报率分别为100%、88%和50%。相比之下,GPT-4作为LLM法官在同一数据上仅达到7%的误报率(95% CI: [3.4%, 13.7%])。
Insight: 论文的核心创新在于:1)首次应用保形预测为幻觉检测提供有限样本覆盖保证,从而精确量化检测能力;2)揭示了‘语义幻觉’现象,即语义上合理的幻觉保留了与源文档的相似性,但引入了嵌入模型无法察觉的事实错误;3)通过实证证明,基于嵌入的检测方法在多种架构、生成器和任务类型上均存在根本性不足,而基于推理的LLM法官(如GPT-4)是更可行的解决方案,这对生产环境中的RAG部署具有重要警示意义。
Abstract: Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the “semantic illusion”: semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.
[76] INFORM-CT: INtegrating LLMs and VLMs FOR Incidental Findings Management in Abdominal CT cs.LG | cs.AI | cs.CV | eess.IVPDF
Idan Tankel, Nir Mazor, Rafi Brada, Christina LeBedis, Guy ben-Yosef
TL;DR: 本文提出了一种名为INFORM-CT的新型框架,该框架通过整合大型语言模型(LLMs)和基础视觉语言模型(VLMs),采用规划-执行的智能体方法,来自动化腹部CT扫描中偶然发现的检测、分类和报告流程,旨在提高效率和精度。
Details
Motivation: 解决传统放射科医生手动检查CT扫描中偶然发现耗时且结果不一致的问题,并遵循既定临床指南实现自动化管理。
Result: 在腹部CT基准测试上对三个器官进行的全自动端到端实验表明,该框架在准确性和效率上优于现有的纯VLM方法。
Insight: 创新点在于将LLM作为规划器生成可执行脚本,与VLM、分割模型等执行器协同工作,形成一种可编程的智能体框架,以实现基于医学指南的结构化自动化流程。
Abstract: Incidental findings in CT scans, though often benign, can have significant clinical implications and should be reported following established guidelines. Traditional manual inspection by radiologists is time-consuming and variable. This paper proposes a novel framework that leverages large language models (LLMs) and foundational vision-language models (VLMs) in a plan-and-execute agentic approach to improve the efficiency and precision of incidental findings detection, classification, and reporting for abdominal CT scans. Given medical guidelines for abdominal organs, the process of managing incidental findings is automated through a planner-executor framework. The planner, based on LLM, generates Python scripts using predefined base functions, while the executor runs these scripts to perform the necessary checks and detections, via VLMs, segmentation models, and image processing subroutines. We demonstrate the effectiveness of our approach through experiments on a CT abdominal benchmark for three organs, in a fully automatic end-to-end manner. Our results show that the proposed framework outperforms existing pure VLM-based approaches in terms of accuracy and efficiency.
eess.IV [Back]
[77] Artificial Intelligence for the Assessment of Peritoneal Carcinosis during Diagnostic Laparoscopy for Advanced Ovarian Cancer eess.IV | cs.AI | cs.CVPDF
Riccardo Oliva, Farahdiba Zarin, Alice Zampolini Faustini, Armine Vardazaryan, Andrea Rosati
TL;DR: 本研究开发了一种基于深度学习的人工智能模型,用于从诊断性腹腔镜视频中自动评估晚期卵巢癌患者的腹膜癌病,并预测Fagotti评分和手术指征,旨在提供标准化的术中肿瘤负荷评估以支持临床决策。
Details
Motivation: Fagotti评分在晚期卵巢癌治疗规划中至关重要,但其评估具有主观性和操作者依赖性,限制了可重复性和广泛应用。本研究旨在开发一种客观、自动化的AI工具来克服这些局限性。
Result: 在包含101个视频的开发数据集和50个视频的独立测试数据集上,模型在解剖结构分割(Dice分数70%)、腹膜癌病分割(Dice分数56%)、解剖站点分类(F1分数~74%)、Fagotti评分预测(归一化RMSE ~1.27)和手术指征预测(F1分数80%)方面均表现出可靠性能。
Insight: 这是首个从腹腔镜视频中自动预测肿瘤细胞减灭术可行性并提供Fagotti评分的AI模型。其创新点在于将视频分析流程自动化(包括关键帧识别、结构分割和病灶分割),并整合这些信息进行视频级别的临床评分和决策预测,为实现标准化、可重复的术中评估提供了新途径。
Abstract: Advanced Ovarian Cancer (AOC) is often diagnosed at an advanced stage with peritoneal carcinosis (PC). Fagotti score (FS) assessment at diagnostic laparoscopy (DL) guides treatment planning by estimating surgical resectability, but its subjective and operator-dependent nature limits reproducibility and widespread use. Videos of patients undergoing DL with concomitant FS assessments at a referral center were retrospectively collected and divided into a development dataset, for data annotation, AI training and evaluation, and an independent test dataset, for internal validation. In the development dataset, FS-relevant frames were manually annotated for anatomical structures and PC. Deep learning models were trained to automatically identify FS-relevant frames, segment structures and PC, and predict video-level FS and indication to surgery (ItS). AI performance was evaluated using Dice score for segmentation, F1-scores for anatomical stations (AS) and ItS prediction, and root mean square error (RMSE) for final FS estimation. In the development dataset, the segmentation model trained on 7,311 frames, achieved Dice scores of 70$\pm$3% for anatomical structures and 56$\pm$3% for PC. Video-level AS classification achieved F1-scores of 74$\pm$3% and 73$\pm$4%, FS prediction showed normalized RMSE values of 1.39$\pm$0.18 and 1.15$\pm$0.08, and ItS reached F1-scores of 80$\pm$8% and 80$\pm$2% in the development (n=101) and independent test datasets (n=50), respectively. This is the first AI model to predict the feasibility of cytoreductive surgery providing automated FS estimation from DL videos. Its reproducible and reliable performance across datasets suggests that AI can support surgeons through standardized intraoperative tumor burden assessment and clinical decision-making in AOC.
cs.MM [Back]
[78] A Preprocessing Framework for Video Machine Vision under Compression cs.MM | cs.CVPDF
Fei Zhao, Mengxi Guo, Shijie Zhao, Junlin Li, Li Zhang
TL;DR: 本文提出了一种面向机器视觉任务的视频预处理框架,通过神经预处理器保留下游任务的关键信息,并结合可微分虚拟编解码器在训练阶段约束码率和失真,从而提升码率-准确率性能。该方法可直接应用于实际场景,实验表明相比标准编解码器基准可节省超过15%的码率。
Details
Motivation: 现有视频编码优化方法主要基于人类感知指标最小化失真,忽视了机器视觉系统的更高需求,因此需要针对机器视觉任务设计专门的预处理框架。
Result: 在两个典型下游任务和多种骨干网络上进行了广泛实验,结果表明相比标准编解码器基准版本,该方法可节省超过15%的码率。
Insight: 创新点在于针对机器视觉任务设计神经预处理器以保留关键信息,并引入可微分虚拟编解码器实现端到端训练,使方案易于部署到实际场景。
Abstract: There has been a growing trend in compressing and transmitting videos from terminals for machine vision tasks. Nevertheless, most video coding optimization method focus on minimizing distortion according to human perceptual metrics, overlooking the heightened demands posed by machine vision systems. In this paper, we propose a video preprocessing framework tailored for machine vision tasks to address this challenge. The proposed method incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance. We further introduce a differentiable virtual codec to provide constraints on rate and distortion during the training stage. We directly apply widely used standard codecs for testing. Therefore, our solution can be easily applied to real-world scenarios. We conducted extensive experiments evaluating our compression method on two typical downstream tasks with various backbone networks. The experimental results indicate that our approach can save over 15% of bitrate compared to using only the standard codec anchor version.
q-fin.CP [Back]
[79] PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents q-fin.CP | cs.AI | cs.CVPDF
Yuqun Zhang, Yuxuan Zhao, Sijia Chen
TL;DR: 本文提出了PyFi框架,旨在通过金字塔式的问题链,使视觉语言模型(VLMs)能够以从简单到复杂、渐进的方式进行金融图像理解。其核心是PyFi-600K数据集,包含60万个金融问答对,按推理难度分层组织。该数据集通过PyFi-adv(一种基于蒙特卡洛树搜索的多智能体对抗机制)自动合成,无需人工标注。利用该数据集,作者对先进VLMs在金融领域进行了细粒度、分层和全面的评估,并通过在金字塔结构问题链上微调Qwen2.5-VL模型,显著提升了其处理复杂金融问题的能力。
Details
Motivation: 解决现有视觉语言模型在金融图像理解领域缺乏结构化、渐进式推理能力评估与训练数据的问题,旨在提升模型对复杂金融视觉信息的理解和专家级推理能力。
Result: 在PyFi-600K数据集上对先进VLMs进行了评估,并通过对Qwen2.5-VL-3B和Qwen2.5-VL-7B模型进行微调,使其在数据集上的平均准确率分别提升了19.52%和8.06%,证明了金字塔式问题链训练的有效性。
Insight: 创新点包括:1)提出了金字塔结构的金融视觉理解数据集构建范式,将问题按感知到专家推理的难度分层;2)设计了基于蒙特卡洛树搜索的多智能体对抗机制(PyFi-adv)来自动生成可扩展的合成数据,避免了昂贵的人工标注;3)通过微调模型在渐进式问题链上,实现了复杂问题的分解式推理,为领域特定的VLM能力评估与提升提供了新思路。
Abstract: This paper proposes PyFi, a novel framework for pyramid-like financial image understanding that enables vision language models (VLMs) to reason through question chains in a progressive, simple-to-complex manner. At the core of PyFi is PyFi-600K, a dataset comprising 600K financial question-answer pairs organized into a reasoning pyramid: questions at the base require only basic perception, while those toward the apex demand increasing levels of capability in financial visual understanding and expertise. This data is scalable because it is synthesized without human annotations, using PyFi-adv, a multi-agent adversarial mechanism under the Monte Carlo Tree Search (MCTS) paradigm, in which, for each image, a challenger agent competes with a solver agent by generating question chains that progressively probe deeper capability levels in financial visual reasoning. Leveraging this dataset, we present fine-grained, hierarchical, and comprehensive evaluations of advanced VLMs in the financial domain. Moreover, fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the pyramid-structured question chains enables these models to answer complex financial questions by decomposing them into sub-questions with gradually increasing reasoning demands, yielding average accuracy improvements of 19.52% and 8.06%, respectively, on the dataset. All resources of code, dataset and models are available at: https://github.com/AgenticFinLab/PyFi .
cs.AI [Back]
[80] ChatGPT and Gemini participated in the Korean College Scholastic Ability Test – Earth Science I cs.AI | cs.CL | cs.CYPDF
Seok-Hyun Ga, Chun-Yen Chang
TL;DR: 本研究利用2025年韩国高考地球科学I科目,深入分析了GPT-4o、Gemini 2.5 Flash和Gemini 2.5 Pro等先进大语言模型的多模态科学推理能力和认知局限。研究设计了三种实验条件(整页输入、单项输入和优化的多模态输入),发现非结构化输入会导致性能显著下降,且即使在优化条件下,模型也表现出根本性的推理缺陷。
Details
Motivation: 随着学生越来越多地使用AI完成作业,引发了关于学术诚信和评估有效性的担忧。本研究旨在通过标准化考试深入分析先进LLMs的多模态科学推理能力和认知局限,以应对课程作业中未经授权使用AI的挑战。
Result: 定量结果表明,非结构化输入(整页输入)由于分割和OCR失败导致性能显著下降。即使在优化的多模态输入条件下,模型也表现出根本性的推理缺陷,未能达到理想水平。
Insight: 研究揭示了LLMs在科学推理中的关键认知弱点:1. ‘感知-认知鸿沟’,即模型能识别视觉数据但无法解读示意图中的符号意义;2. ‘计算-概念化差异’,即能执行计算但无法应用基础科学概念;3. ‘过程幻觉’,即跳过视觉验证而依赖无根据的背景知识。这些弱点为设计针对性的’抗AI问题’以区分学生真实能力与AI生成答案提供了可行线索。
Abstract: The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that “Perception Errors” were dominant, highlighting a “Perception-Cognition Gap” where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a “Calculation-Conceptualization Discrepancy,” successfully performing calculations while failing to apply the underlying scientific concepts, and “Process Hallucination,” where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing “AI-resistant questions” that target these specific cognitive vulnerabilities. By exploiting AI’s weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.
[81] Explaining the Reasoning of Large Language Models Using Attribution Graphs cs.AI | cs.CLPDF
Chase Walker, Rickard Ewetz
TL;DR: 本文提出了一种名为CAGE(Context Attribution via Graph Explanations)的框架,通过构建一个保留因果性和行随机性的有向归因图,来解释大语言模型(LLMs)的推理过程。该方法量化了每个生成token如何受到提示和所有先前生成token的影响,从而克服了现有上下文归因方法忽略代际间影响的缺陷。
Details
Motivation: 大语言模型虽然能力强大,但其推理过程不透明,引发了安全和信任问题。现有的上下文归因方法在解释自回归LLMs时,直接将生成的token与提示相关联,丢弃了生成过程中的代际间影响,导致解释不完整。
Result: 在多个模型、数据集、评估指标和方法上,CAGE框架显著提高了上下文归因的忠实度(faithfulness),平均提升高达40%。
Insight: 核心创新在于引入了结构化的归因图来建模和量化LLM生成过程中复杂的代际依赖关系,并通过沿图中路径边缘化中间贡献来计算归因,为解释LLM的推理提供了更完整和忠实的方法论框架。
Abstract: Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.
cs.IR [Back]
[82] Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models cs.IR | cs.AI | cs.CV | cs.LG | cs.MMPDF
Mikel Williams-Lekuona, Georgina Cosma
TL;DR: 本文提出ICAR(图像复杂度感知检索)方法,使视觉语言模型中的视觉Transformer能够根据图像复杂度自适应调整计算量:简单图像使用较少计算,复杂图像则使用完整网络深度处理。该方法通过双路径训练确保不同处理深度的图像嵌入与文本嵌入在相同语义空间中保持兼容,从而实现无需重排的直接跨模态匹配。同时,作者开发了ConvNeXt-IC模型用于图像复杂度分类,以决定计算量分配。
Details
Motivation: 现有视觉语言模型中的视觉Transformer对所有图像均采用统一计算量(如ViT-L/14需175.33 GFLOPs),导致处理简单图像时存在计算浪费。本文旨在根据图像复杂度动态分配计算资源,提升模型效率。
Result: 在增强真实网络数据的标准基准测试中,ICAR实现了20%的实际加速,同时保持类别级性能及95%的实例级性能。ConvNeXt-IC在图像复杂度评估任务上达到SOTA,与人类判断的皮尔逊相关系数为0.959,并加速4.4倍。
Insight: 创新点包括:1)通过双路径训练解决不同计算路径下的跨模态对齐问题,确保嵌入兼容性;2)将复杂度评估构建为分类任务,利用现代分类器骨干(而非专用架构)实现高效准确的复杂度预测;3)实现单阶段自适应检索,避免了现有两阶段方法的重排开销。
Abstract: Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.