Table of Contents
- cs.CL [Total: 23]
- cs.CV [Total: 84]
- physics.ins-det [Total: 1]
- eess.IV [Total: 1]
- cs.CY [Total: 1]
- cs.RO [Total: 4]
- cs.LG [Total: 2]
- cs.SE [Total: 1]
- cs.IR [Total: 1]
cs.CL [Back]
[1] Watermarks for Embeddings-as-a-Service Large Language Models
Anudeex Shetty
Main category: cs.CL
TL;DR: 该论文研究了Embeddings-as-a-Service (EaaS)中大型语言模型水印技术,揭示了现有水印方法易被输入文本改写攻击的问题,并提出了一种新的线性变换水印技术(WET)来提高鲁棒性。
Details
Motivation: EaaS提供基于大型语言模型的文本嵌入服务,但面临模型被克隆的风险。现有水印技术无法抵御输入改写的攻击,亟需更鲁棒的解决方案。Contribution: 1. 揭示了现有EaaS水印技术的漏洞,即通过改写输入文本可绕过水印。2. 提出了一种新的线性变换水印技术WET,展示了对改写攻击的鲁棒性。
Method: 提出WET技术,通过对嵌入向量进行线性变换实现水印嵌入,并通过反向变换验证水印。实验表明其对改写攻击具有强鲁棒性。
Result: WET在抵御改写攻击时表现出近乎完美的可验证性,显著优于现有方法。消融实验验证了各组件及超参数的重要性。
Insight: 1. 改写输入是现有EaaS水印技术的有效攻击手段。2. 线性变换是一种简单但高效的鲁棒水印嵌入方法。
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in natural language understanding and generation. Based on these LLMs, businesses have started to provide Embeddings-as-a-Service (EaaS), offering feature extraction capabilities (in the form of text embeddings) that benefit downstream natural language processing tasks. However, prior research has demonstrated that EaaS is vulnerable to imitation attacks, where an attacker clones the service’s model in a black-box manner without access to the model’s internal workings. In response, watermarks have been added to the text embeddings to protect the intellectual property of EaaS providers by allowing them to check for model ownership. This thesis focuses on defending against imitation attacks by investigating EaaS watermarks. To achieve this goal, we unveil novel attacks and propose and validate new watermarking techniques. Firstly, we show that existing EaaS watermarks can be removed through paraphrasing the input text when attackers clone the model during imitation attacks. Our study illustrates that paraphrasing can effectively bypass current state-of-the-art EaaS watermarks across various attack setups (including different paraphrasing techniques and models) and datasets in most instances. This demonstrates a new vulnerability in recent EaaS watermarking techniques. Subsequently, as a countermeasure, we propose a novel watermarking technique, WET (Watermarking EaaS with Linear Transformation), which employs linear transformation of the embeddings. Watermark verification is conducted by applying a reverse transformation and comparing the similarity between recovered and original embeddings. We demonstrate its robustness against paraphrasing attacks with near-perfect verifiability. We conduct detailed ablation studies to assess the significance of each component and hyperparameter in WET.
[2] Alleviating Choice Supportive Bias in LLM with Reasoning Dependency Generation
Nan Zhuang,Wenshuo Wang,Lekai Qian,Yuxiao Wang,Boyu Cao,Qi Liu
Main category: cs.CL
TL;DR: 该论文提出了一种新颖的框架RDG,通过生成平衡的推理QA对来减轻LLM中的选择支持性偏差(CSB),显著提升模型的客观性。
Details
Motivation: 现有研究表明,LLM在评估时表现出选择支持性偏差(CSB),偏爱其选择的选项,影响了AI辅助决策的客观性。现有方法多针对社会偏见,但对认知偏见的缓解研究不足。Contribution: 1. 首次提出针对CSB的解决方案RDG框架;2. 自动生成平衡的推理QA对,显式处理选择、证据和理由间的依赖关系;3. 实验证明RDG显著改善LLM的偏差问题。
Method: 通过Reasoning Dependency Generation(RDG)框架,生成包含Contextual Dependency Data和Dependency Decouple Data的大规模QA对,用于LLM微调。
Result: 实验结果显示,RDG微调的LLM在记忆实验中提升81.5%,评估实验中提升94.3%,且在标准BBQ基准上保持性能。
Insight: 该方法为缓解LLM中的认知偏见提供了新思路,有助于提升AI决策支持的可靠性。
Abstract: Recent studies have demonstrated that some Large Language Models exhibit choice-supportive bias (CSB) when performing evaluations, systematically favoring their chosen options and potentially compromising the objectivity of AI-assisted decision making. While existing debiasing approaches primarily target demographic and social biases, methods for addressing cognitive biases in LLMs remain largely unexplored. In this work, we present the first solution to address CSB through Reasoning Dependency Generation (RDG), a novel framework for generating unbiased reasoning data to mitigate choice-supportive bias through fine-tuning. RDG automatically constructs balanced reasoning QA pairs, explicitly (un)modeling the dependencies between choices, evidences, and justifications. Our approach is able to generate a large-scale dataset of QA pairs across domains, incorporating Contextual Dependency Data and Dependency Decouple Data. Experiments show that LLMs fine-tuned on RDG-generated data demonstrate a 81.5% improvement in memory-based experiments and 94.3% improvement in the evaluation-based experiment, while maintaining similar performance on standard BBQ benchmarks. This work pioneers an approach for addressing cognitive biases in LLMs and contributes to the development of more reliable AI-assisted decision support systems.
[3] Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaraní
Nemika Tyagi,Nelvin Licona Guevara,Olga Kellert
Main category: cs.CL
TL;DR: 这篇论文提出了一种利用大语言模型(LLM)自动标注双语语料中社会语言学和主题特征的方法,并在西班牙语-英语和西班牙语-瓜拉尼语两种语境下进行了验证。
Details
Motivation: 研究旨在通过自动标注方法替代传统手动标注,高效分析双语语料中的社会语言学特征,尤其是在低资源语言环境(如西班牙语-瓜拉尼语)中。Contribution: 论文的主要贡献包括:1)开发了一种基于LLM的自动化标注流程;2)在大规模双语语料中验证了社会语言学和主题分析的可行性;3)揭示了性别、语言优势和语用功能之间的系统性关系。
Method: 采用大语言模型对3,691个代码转换句子进行自动标注,整合了迈阿密双语语料库的人口统计数据,并为西班牙语-瓜拉尼语数据集新增了主题标注。
Result: 研究结果显示性别、语言优势与语用功能之间的关联(迈阿密数据),以及瓜拉尼语和西班牙语在巴拉圭文本中明显的双言现象。这些发现扩展了传统社会语言学的观察。
Insight: 大语言模型可以可靠地捕捉传统需手动标注的社会语言学模式,为跨语言和低资源双语研究提供了新的计算方法。
Abstract: This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaraní. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaraní dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaraní and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.
[4] From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation
Qingchuan Li,Mingyue Cheng,Zirui Liu,Daoyu Wang,Yuting Zeng,Tongxuan Liu
Main category: cs.CL
TL;DR: 论文提出了一种基于LLM的逆向逻辑推理框架(HBLR),通过选择性符号翻译和假设驱动的逆向推理,解决了传统前向推理中的冗余和语义漂移问题。
Details
Motivation: 传统的前向逻辑推理方法存在推理路径冗余、幻觉步骤和语义漂移等问题,导致推理效率低且不可靠。Contribution: 1. 提出了HBLR框架,结合了选择性符号翻译和假设驱动的逆向推理。2. 引入了翻译和推理的反思模块,提升语义保真度和逻辑一致性。
Method: 1. 选择性符号翻译:仅将高置信度文本转换为逻辑形式(如FOL),其余保留自然语言。2. 假设驱动逆向推理:假设结论为真,递归验证前提。3. 反思模块:评估和修正翻译与推理中的错误。
Result: 在五个推理基准测试中,HBLR在准确性和效率上均优于基线方法。
Insight: 逆向推理结合选择性符号翻译能够更高效地模拟人类演绎思维,减少冗余和错误。
Abstract: Logical reasoning is a core challenge in natural language understanding and a fundamental capability of artificial intelligence, underpinning scientific discovery, mathematical theorem proving, and complex decision-making. Despite the remarkable progress of large language models (LLMs), most current approaches still rely on forward reasoning paradigms, generating step-by-step rationales from premises to conclusions. However, such methods often suffer from redundant inference paths, hallucinated steps, and semantic drift, resulting in inefficient and unreliable reasoning. In this paper, we propose a novel framework, Hypothesis-driven Backward Logical Reasoning (HBLR). The core idea is to integrate confidence-aware symbolic translation with hypothesis-driven backward reasoning. In the translation phase, only high-confidence spans are converted into logical form, such as First-Order Logic (FOL), while uncertain content remains in natural language. A translation reflection module further ensures semantic fidelity by evaluating symbolic outputs and reverting lossy ones back to text when necessary. In the reasoning phase, HBLR simulates human deductive thinking by assuming the conclusion is true and recursively verifying its premises. A reasoning reflection module further identifies and corrects flawed inference steps, enhancing logical coherence. Extensive experiments on five reasoning benchmarks demonstrate that HBLR consistently outperforms strong baselines in both accuracy and efficiency.
[5] Nexus: Higher-Order Attention Mechanisms in Transformers
Hanting Chen,Chu Zhong,Kai Han,Yuchuan Tian,Yuchen Liang,Tianyu Guo,Xinghao Chen,Dacheng Tao,Yunhe Wang
Main category: cs.CL
TL;DR: 该论文提出了一种名为Higher-Order Attention Network (Hon)的新型Transformer架构,通过递归框架增强高阶依赖关系的建模能力。
Details
Motivation: 标准的Transformer依赖一阶注意力机制,难以捕捉复杂的高阶依赖关系,限制了其表征能力。Contribution: 提出Hon架构,通过动态递归的Query和Key向量生成方法,增强了高阶依赖关系建模能力。
Method: Hon采用嵌套的自注意力机制动态生成Query和Key向量,并通过参数共享策略保证高效性。
Result: 理论分析和实验结果表明,Hon突破了标准注意力机制的线性瓶颈,在多个基准测试中表现优于标准Transformer。
Insight: 通过递归机制动态生成注意力计算的输入,可以有效提升模型对高阶依赖的捕捉能力,同时保持参数效率。
Abstract: Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the \textbf{Higher-Order Attention Network (Hon)}, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Hon dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $\mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Hon outperforms standard Transformers on multiple benchmarks.
[6] Characterizing Language Use in a Collaborative Situated Game
Nicholas Tomlin,Naitian Zhou,Eve Fleisig,Liangyuan,Chen,Téa Wright,Lauren Vinh,Laura X. Ma,Seun Eisape,Ellie French,Tingting Du,Tianjiao Zhang,Alexander Koller,Alane Suhr
Main category: cs.CL
TL;DR: 论文收集并分析了《Portal 2》合作模式下的玩家对话数据(11.5小时),揭示了复杂情境下的语言现象,并公开了包含多模态数据的语料库。
Details
Motivation: 研究动机在于探索复杂协作环境中语言使用的独特性,填补现有聊天或任务导向对话数据的空白。Contribution: 主要贡献是构建并公开了Portal Dialogue Corpus,包含丰富的多模态数据和支持标注。
Method: 通过采集玩家在《Portal 2》中的对话和行为数据,结合手动和自动标注进行分析。
Result: 发现了空间参考、澄清与修正、临时约定形成等独特语言现象。
Insight: 复杂协作情境中的语言使用与常规对话有显著差异,未来研究可利用公开语料进行更深入分析。
Abstract: Cooperative video games, where multiple participants must coordinate by communicating and reasoning under uncertainty in complex environments, yield a rich source of language data. We collect the Portal Dialogue Corpus: a corpus of 11.5 hours of spoken human dialogue in the co-op mode of the popular Portal 2 virtual puzzle game, comprising 24.5K total utterances. We analyze player language and behavior, identifying a number of linguistic phenomena that rarely appear in most existing chitchat or task-oriented dialogue corpora, including complex spatial reference, clarification and repair, and ad-hoc convention formation. To support future analyses of language use in complex, situated, collaborative problem-solving scenarios, we publicly release the corpus, which comprises player videos, audio, transcripts, game state data, and both manual and automatic annotations of language data.
[7] Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates
Yixing Xu,Chao Li,Xuanwu Yin,Spandan Tiwari,Dong Li,Ashish Sirasao,Emad Barsoum
Main category: cs.CL
TL;DR: 论文提出Dual LoRA方法,通过分离低秩矩阵为幅度组和方向组,改进原有LoRA性能,实验表明其在多种NLP任务中表现优于LoRA及其变体。
Details
Motivation: 现有LoRA方法因低秩假设性能有限,需改进以适应下游任务。Contribution: 提出Dual LoRA,通过引入幅度和方向分离的更新机制,提升模型性能。
Method: 分离低秩矩阵为幅度组(ReLU控制)和方向组(符号函数控制),模拟全参数微调过程。
Result: 在GPT-2、RoBERTa等模型上的实验表明,Dual LoRA在相同可训练参数下优于LoRA及其变体。
Insight: 分离更新方向与幅度能更有效地模拟梯度优化,提升参数效率与模型性能。
Abstract: Low-rank adaptation (LoRA) is one of the most popular methods among parameter-efficient fine-tuning (PEFT) methods to adapt pre-trained large language models (LLMs) to specific downstream tasks. However, the model trained based on LoRA often has an unsatisfactory performance due to its low-rank assumption. In this paper, we propose a novel method called Dual LoRA to improve the performance by incorporating an inductive bias into the original LoRA. Specifically, we separate low-rank matrices into two groups: the magnitude group to control whether or not and how far we should update a parameter and the direction group to decide whether this parameter should move forward or backward, to better simulate the parameter updating process of the full fine-tuning based on gradient-based optimization algorithms. We show that this can be simply achieved by adding a ReLU function to the magnitude group and a sign function to the direction group. We conduct several experiments over a wide range of NLP tasks, including natural language generation (NLG), understanding (NLU), and commonsense reasoning datasets on GPT-2, RoBERTa, DeBERTa, and LLaMA-1/2/3 as baseline models. The results show that we consistently outperform LoRA and its state-of-the-art variants with the same number of trainable parameters.
[8] PretrainZero: Reinforcement Active Pretraining
Xingrun Xing,Zhiyuan Fan,Jie Lou,Guoqi Li,Jiajun Zhang,Debing Zhang
Main category: cs.CL
TL;DR: PretrainZero是一个基于强化学习的主动预训练框架,旨在通过主动学习统一的推理策略,识别预训练语料库中的信息内容,从而扩展强化学习从领域特定任务到通用推理的能力。
Details
Motivation: 当前基于强化学习的大型模型虽然在特定领域表现出专家级能力,但仍严重依赖可验证的奖励信号,限制了通用推理能力的扩展。模仿人类的主动学习行为,PretrainZero试图通过无监督的方式提升通用推理能力。Contribution: 1) 提出主动预训练框架PretrainZero,通过强化学习统一推理策略;2) 无监督学习方式,不需标签或预训练奖励模型;3) 通过逐步挑战性任务增强推理能力。
Method: PretrainZero结合强化学习和主动学习,直接从预训练语料库中识别信息内容并预测,通过逐步增加任务难度提升模型推理能力。
Result: 在MMLU-Pro、SuperGPQA和数学平均基准上,PretrainZero显著提升了Qwen3-4B-Base模型的性能(提升8.43、5.96和10.60分)。
Insight: 无监督强化学习在通用推理任务中具有潜力,主动识别信息内容的策略可以显著提升模型的泛化能力。
Abstract: Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
[9] A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention
Di Xiu,Hongyin Tang,Bolin Rong,Lizhi Yan,Jingang Wang,Yifan Lu,Xunliang Cai
Main category: cs.CL
TL;DR: 该论文研究了Top-$k$稀疏注意力机制在解码和训练阶段的有效性与理论机制,验证了精确Top-$k$解码的性能优势,并探讨了近似Top-$k$算法的精度对下游任务的影响。
Details
Motivation: 大规模语言模型(LLMs)在长上下文建模中的应用日益广泛,但其推理计算成本成为瓶颈。研究旨在探索Top-$k$注意力机制在解码和训练中的潜力,以提高效率并揭示其理论机制。Contribution: 1. 验证精确Top-$k$解码的效果;2. 探索原生Top-$k$注意力训练策略;3. 研究近似Top-$k$算法精度与下游任务性能的关系;4. 从熵的角度提供理论解释。
Method: 通过实验验证精确和近似Top-$k$注意力机制的效果,从熵的角度分析其理论机制。
Result: 实验表明,Top-$k$解码在性能上可与全注意力媲美或超越;训练与推理的一致性进一步提升效果;近似算法的精度与性能正相关;熵减现象支持低熵状态适应Top-$k$解码的假设。
Insight: Top-$k$注意力机制在计算效率和性能之间提供了平衡,低熵状态可能更适合稀疏注意力机制的应用。
Abstract: Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding’s potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer’s precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.
[10] Understanding LLM Reasoning for Abstractive Summarization
Haohan Yuan,Siu Cheung Hui,Haopeng Zhang
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLMs)在抽象摘要任务中的推理能力,发现推理策略的效果因具体情境而异,同时在摘要质量和事实忠实性之间存在权衡。
Details
Motivation: 尽管LLMs在数学和代码生成等分析任务中表现出色,但其在抽象摘要中的推理能力尚未得到充分验证。本文旨在填补这一空白,探讨不同推理策略在摘要任务中的效果。Contribution: 论文的主要贡献包括:1)为摘要任务量身定制了通用推理策略;2)对8种推理策略和3种大型推理模型(LRMs)在8个数据集上进行了系统比较;3)揭示了摘要质量与事实忠实性之间的权衡关系。
Method: 论文首先将通用推理策略适配到摘要任务中,然后通过大规模实验比较了8种推理策略和3种LRMs在多种数据集上的表现,评估了摘要质量和忠实性。
Result: 研究结果表明,推理并非通用解决方案,其效果高度依赖具体策略和上下文。显式推理策略会提高流畅性但牺牲事实忠实性,而隐式推理则呈现相反模式。此外,增加LRMs的内部推理预算甚至会损害事实一致性。
Insight: 论文揭示了抽象摘要任务中‘忠实压缩’比‘创造性过度思考’更重要的观点,为未来设计和优化摘要模型提供了重要启示。
Abstract: While the reasoning capabilities of Large Language Models (LLMs) excel in analytical tasks such as mathematics and code generation, their utility for abstractive summarization remains widely assumed but largely unverified. To bridge this gap, we first tailor general reasoning strategies to the summarization domain. We then conduct a systematic, large scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, assessing both summary quality and faithfulness. Our findings show that reasoning is not a universal solution and its effectiveness is highly dependent on the specific strategy and context. Specifically, we observe a trade-off between summary quality and factual faithfulness: explicit reasoning strategies tend to improve fluency at the expense of factual grounding, while implicit reasoning in LRMs exhibits the inverse pattern. Furthermore, increasing an LRM’s internal reasoning budget does not improve, and can even hurt, factual consistency, suggesting that effective summarization demands faithful compression rather than creative over-thinking.
[11] Fine-grained Narrative Classification in Biased News Articles
Zeba Afroz,Harsh Vardhan,Pawan Bhakuni,Aanchal Punia,Rajdeep Kumar,Md. Shad Akhtar
Main category: cs.CL
TL;DR: 这篇论文提出了一个细粒度的叙事分类方法,用于分析印度新闻媒体中的偏颇文章。通过创建多级标注数据集INDI-PROP,并设计FANTA和TPTC两种基于GPT-4o-mini的分类框架,论文在偏见、叙事和说服技巧分类任务上取得了显著提升。
Details
Motivation: 新闻媒体中的叙事和偏见是宣传的核心组成部分,但现有研究缺乏对印度新闻媒体中意识形态驱动的宣传进行细粒度分析。论文旨在填补这一空白。Contribution: 主要贡献包括:1) 创建了首个意识形态驱动的多级标注数据集INDI-PROP;2) 提出了FANTA和TPTC两种基于GPT-4o-mini的分类框架;3) 在偏见、叙事和说服技巧分类任务上实现了显著提升。
Method: 论文设计了两种方法:1) FANTA通过多跳提示整合信息提取和上下文框架;2) TPTC通过两阶段分解说服线索。两者均基于GPT-4o-mini。
Result: 实验表明,FANTA和TPTC在偏见、叙事和说服技巧分类任务上显著优于基线方法。
Insight: 论文揭示了印度新闻媒体中的意识形态叙事结构,并展示了多级标注和大模型提示在宣传分析中的有效性。
Abstract: Narratives are the cognitive and emotional scaffolds of propaganda. They organize isolated persuasive techniques into coherent stories that justify actions, attribute blame, and evoke identification with ideological camps. In this paper, we propose a novel fine-grained narrative classification in biased news articles. We also explore article-bias classification as the precursor task to narrative classification and fine-grained persuasive technique identification. We develop INDI-PROP, the first ideologically grounded fine-grained narrative dataset with multi-level annotation for analyzing propaganda in Indian news media. Our dataset INDI-PROP comprises 1,266 articles focusing on two polarizing socio-political events in recent times: CAA and the Farmers’ protest. Each article is annotated at three hierarchical levels: (i) ideological article-bias (pro-government, pro-opposition, neutral), (ii) event-specific fine-grained narrative frames anchored in ideological polarity and communicative intent, and (iii) persuasive techniques. We propose FANTA and TPTC, two GPT-4o-mini guided multi-hop prompt-based reasoning frameworks for the bias, narrative, and persuasive technique classification. FANTA leverages multi-layered communicative phenomena by integrating information extraction and contextual framing for hierarchical reasoning. On the other hand, TPTC adopts systematic decomposition of persuasive cues via a two-stage approach. Our evaluation suggests substantial improvement over underlying baselines in each case.
[12] AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment
Ahmad Aghaebrahimian
Main category: cs.CL
TL;DR: AlignCheck提出了一种可解释的事实一致性评估框架,通过分解文本为原子事实并引入加权指标,改进了现有方法在开放域和高风险领域(如临床)中的应用。
Details
Motivation: 当前大型语言模型在生成内容时容易产生幻觉(hallucination),尤其是在高风险领域中,现有评估指标缺乏对事实一致性的有效评估和可解释性,难以诊断和修正错误。Contribution: 1. 提出了一种可解释的事实一致性评估框架;2. 引入灵活的、无模式的方法分解文本为原子事实;3. 提出加权指标以增强事实评估;4. 设计了控制复杂领域评估复杂度的机制。
Method: 通过分解文本为原子事实,并结合加权指标评估事实一致性,同时提供机制以控制复杂领域的评估复杂度。
Result: 在通用和临床数据集上进行了测试,并开源代码以支持未来研究。
Insight: 通过分解和加权方法,AlignCheck在开放域和高风险领域中实现了更有效和可解释的事实一致性评估。
Abstract: Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.
[13] Evaluating Hydro-Science and Engineering Knowledge of Large Language Models
Shiruo Hu,Wenbo Shan,Yingjia Li,Zhiqi Wan,Xinpeng Yu,Yunjia Qi,Haotian Xia,Yang Xiao,Dingxiao Liu,Jiaru Wang,Chenxu Gong,Ruixi Zhang,Shuyue Wu,Shibo Cui,Chee Hui Lai,Wei Luo,Yubin He,Bin Xu,Jianshi Zhao
Main category: cs.CL
TL;DR: 该论文提出了一个专门评估大语言模型(LLMs)在水科学与工程(Hydro-SE)领域知识的基准测试(Hydro-SE Bench),并分析了LLMs在该领域的能力与局限性。
Details
Motivation: Hydro-SE是一个跨学科关键领域,但目前对大语言模型在该领域的知识与应用能力缺乏系统评估。Contribution: 提出了包含4000道多选题的Hydro-SE Bench基准,覆盖9个子领域,评估LLMs的基础概念知识、工程应用能力和推理计算能力。
Method: 设计了Hydro-SE Bench基准测试,覆盖广泛的知识点和能力需求,并对商业和小型LLMs进行了系统评估。
Result: 商业LLMs的准确率在0.74-0.80之间,小型LLMs为0.41-0.68。LLMs在自然科学相关领域表现较好,但在行业标准和水利结构等特定领域较弱。
Insight: 模型规模主要提升推理和计算能力,但在实际工程应用中仍有改进空间,研究为LLMs的开发和应用提供了明确方向。
Abstract: Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields and enables evaluation of LLMs in aspects of basic conceptual knowledge, engineering application ability, and reasoning and calculation ability. The evaluation results on Hydro-SE Bench show that the accuracy values vary among 0.74 to 0.80 for commercial LLMs, and among 0.41 to 0.68 for small-parameter LLMs. While LLMs perform well in subfields closely related to natural and physical sciences, they struggle with domain-specific knowledge such as industry standards and hydraulic structures. Model scaling mainly improves reasoning and calculation abilities, but there is still great potential for LLMs to better handle problems in practical engineering application. This study highlights the strengths and weaknesses of LLMs for Hydro-SE tasks, providing model developers with clear training targets and Hydro-SE researchers with practical guidance for applying LLMs.
[14] AR-Med: Automated Relevance Enhancement in Medical Search via LLM-Driven Information Augmentation
Chuyue Wang,Jie Feng,Yuxi Wu,Hang Zhang,Zhiguo Fan,Bing Cheng,Wei Lin
Main category: cs.CL
TL;DR: AR-Med是一个用于医疗搜索的自动化相关性增强框架,通过LLM驱动的信息增强,结合检索增强方法确保高准确性和可靠性,并通过知识蒸馏压缩大模型为紧凑的学生模型,提高了在线服务的效率。
Details
Motivation: 在线医疗平台的搜索准确性对用户安全和服务的有效性至关重要,传统方法难以理解复杂和细微的用户查询,LLM虽具备潜力但存在事实幻觉、专业知识缺口和高成本等问题。Contribution: 1. 提出了AR-Med框架,结合检索增强方法确保LLM推理的准确性;2. 设计了知识蒸馏方案,将大模型压缩为紧凑的学生模型;3. 开发了LocalQSMed基准测试,用于模型迭代和离线在线性能对齐。
Method: 1. 使用检索增强技术将LLM推理锚定在已验证的医学知识上;2. 通过知识蒸馏将大教师模型压缩为紧凑的学生模型;3. 引入LocalQSMed多专家标注的基准测试。
Result: AR-Med在离线测试中实现了超过93%的准确率,比原在线系统提升了24%,同时在在线相关性和用户满意度上取得显著提升。
Insight: 结合检索增强的知识蒸馏方法可以有效解决LLM在医疗等高危领域中的事实幻觉和高成本问题,同时LocalQSMed基准测试为模型的迭代和性能对齐提供了可靠工具。
Abstract: Accurate and reliable search on online healthcare platforms is critical for user safety and service efficacy. Traditional methods, however, often fail to comprehend complex and nuanced user queries, limiting their effectiveness. Large language models (LLMs) present a promising solution, offering powerful semantic understanding to bridge this gap. Despite their potential, deploying LLMs in this high-stakes domain is fraught with challenges, including factual hallucinations, specialized knowledge gaps, and high operational costs. To overcome these barriers, we introduce \textbf{AR-Med}, a novel framework for \textbf{A}utomated \textbf{R}elevance assessment for \textbf{Med}ical search that has been successfully deployed at scale on the Online Medical Delivery Platforms. AR-Med grounds LLM reasoning in verified medical knowledge through a retrieval-augmented approach, ensuring high accuracy and reliability. To enable efficient online service, we design a practical knowledge distillation scheme that compresses large teacher models into compact yet powerful student models. We also introduce LocalQSMed, a multi-expert annotated benchmark developed to guide model iteration and ensure strong alignment between offline and online performance. Extensive experiments show AR-Med achieves an offline accuracy of over 93%, a 24% absolute improvement over the original online system, and delivers significant gains in online relevance and user satisfaction. Our work presents a practical and scalable blueprint for developing trustworthy, LLM-powered systems in real-world healthcare applications.
[15] Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
Jingyang Ou,Jiaqi Han,Minkai Xu,Shaoxuan Xu,Jianwen Xie,Stefano Ermon,Yi Wu,Chongxuan Li
Main category: cs.CL
TL;DR: 论文提出了一种针对扩散大语言模型 (dLLMs) 的强化学习框架 ESPO,将其视为序列级动作生成,显著优于传统的 token 级方法。
Details
Motivation: 由于扩散语言模型的迭代去噪步骤缺乏 token 级的概率分解,传统的强化学习方法难以适用。为了解决这一问题,作者提出了序列级的优化框架。Contribution: 主要的贡献是提出了 ESPO(ELBO-based Sequence-level Policy Optimization),一种基于证据下限 (ELBO) 的序列级强化学习方法。
Method: ESPO 将整个序列生成视为单一动作,利用 ELBO 作为序列级概率的近似值,并通过 token 级重要性比值归一化和鲁棒的 KL 散度估计确保训练稳定性。
Result: 在数学推理、编程和规划任务中,ESPO 显著优于 token 级基线方法,尤其在 Countdown 任务中提升了 20-40 分。
Insight: 序列级优化是一种适用于扩散语言模型的、原则性强且效果显著的强化学习范式。
Abstract: Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.
[16] In-Context Representation Hijacking
Itay Yona,Amir Sarid,Michael Karasik,Yossi Gandelsman
Main category: cs.CL
TL;DR: 该论文提出了一种名为’Doublespeak’的攻击方法,通过替换上下文中的关键词,诱导大型语言模型(LLMs)将无害词汇的内部表征与有害词汇的表征趋同,从而绕过安全对齐机制。
Details
Motivation: 当前大型语言模型的安全对齐机制主要关注表面文本的过滤,忽略了内部表征的潜在漏洞。作者希望通过揭示这种表征层面的安全问题,推动更深入的对齐策略研究。Contribution: 论文的主要贡献是提出了一种名为’Doublespeak’的攻击方法,证明通过简单的上下文替换,可以绕过LLMs的安全对齐机制,并揭示了表征层面的攻击面。
Method: 通过系统地将有害关键词替换为无害词(如’bomb’替换为’carrot’),并观察模型内部表征的变化,证明无害词的表征会逐渐趋同于有害词的表征。
Result: 实验表明,Doublespeak攻击在开源和闭源模型中均有效,如在Llama-3.3-70B-Instruct上的攻击成功率高达74%。
Insight: 研究揭示了当前LLMs的安全对齐策略的表征层面缺陷,表明未来的对齐工作需要从表征层面入手,而非仅依赖表面文本过滤。
Abstract: We introduce \textbf{Doublespeak}, a simple \emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textit{bomb}) with a benign token (e.g., \textit{carrot}) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., How to build a bomb?’’), thereby bypassing the model’s safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
[17] Enhancing Instruction-Following Capabilities in Seq2Seq Models: DoLA Adaptations for T5
Huey Sun,Anabel Yong,Lorenzo Gilly,Felipe Jin
Main category: cs.CL
TL;DR: 该论文将对比解码方法DoLa首次适配到T5和FLAN-T5模型中,研究了其在提升指令跟随能力方面的效果,并通过分层分析量化了其对生成文本忠实度的影响。
Details
Motivation: 现有的对比解码方法(如DoLa)仅在解码器架构中实现,且主要用于提升事实性。本文旨在探索其在编码器-解码器架构(如T5)中对指令跟随能力的效果。Contribution: 首次将DoLa方法适配到T5和FLAN-T5模型,并研究了其对指令跟随能力的影响;通过分层分析量化了DoLa对生成文本概率的作用。
Method: 采用DoLa方法(对比分层解码),在T5和FLAN-T5模型中进行实验,并通过分层分析logit演化的方式评估效果。
Result: DoLa在某些任务中提升了生成文本的忠实性,但对其他任务有负面影响;分层分析揭示了其在不同层中的差异化作用。
Insight: 对比解码方法的效果与任务类型密切相关,分层分析有助于理解模型内部的生成机制。
Abstract: Contrastive decoding is a lightweight and effective inference-time method that improves the quality of text generation in Large Language Models. However, algorithms such as DoLa (Decoding by Contrastive Layers) have only been implemented in decoder-only architectures and studied for their impact on improving factuality. This work adapts DoLa for the T5 and FLAN-T5 model families and evaluates its impact on the models’ instruction following capabilities, which to our knowledge is the first implementation of a contrastive decoding strategy in an encoder-decoder architecture. Our results show that DoLa improves the faithfulness of text generation for certain categories of tasks and harms others. To understand these results, we present a layer-by-layer analysis of logit evolution in a FLAN-T5 model to quantify DoLa’s impact on token output probabilities.
[18] Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology
Kylie L. Anglin,Stephanie Milan,Brittney Hernandez,Claudia Ventura
Main category: cs.CL
TL;DR: 论文探讨了在心理学领域中,如何通过提示工程优化大型语言模型(LLMs)的分类性能,以识别文本中的理论构念。实验评估了五种提示策略,发现构念定义和任务框架对性能影响最大,而少量样例与自动提示工程的结合效果最佳。
Details
Motivation: 心理学领域中的理论构念具有精确的定义,但LLMs的预训练数据可能无法充分覆盖这些内容。研究旨在通过提示工程提升LLMs在这一领域的分类性能,使其输出更符合专家判断。Contribution: 提供了一个实证框架,通过五种提示策略优化LLMs的分类性能,并确定了构念定义和任务框架为关键因素;推荐结合人工与自动生成的提示以提高性能。
Method: 实验评估了五种提示策略:1)基于代码书的经验提示选择,2)自动提示工程,3)人物角色提示,4)思维链推理,5)解释性提示;结合零样本和少量样本分类任务。
Result: 研究发现,人物角色提示、思维链和解释性提示无法完全弥补不良提示的性能损失;少量样例与自动提示工程的结合在多个构念和模型上表现最佳。
Insight: 提示的质量对LLMs的分类性能至关重要,尤其是在需要与专家判断对齐的场景中;建议通过大量提示变体的生成与评估来优化性能。
Abstract: Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies –codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.
[19] Training and Evaluation of Guideline-Based Medical Reasoning in LLMs
Michael Staniek,Artem Sokolov,Stefan Riezler
Main category: cs.CL
TL;DR: 这篇论文研究了如何通过共识指南训练大型语言模型(LLMs),使其在医学推理中逐步遵循指南并提供忠实解释,从而提高预测的可信度。
Details
Motivation: 医学预测模型的准确性虽然有所提升,但其解释能力不足,难以获得医疗从业者的信任。共识指南在医学中广泛存在,因此通过训练LLMs遵循这些指南是提升透明度和可信度的关键。Contribution: 论文的主要贡献在于提出了一种基于共识指南的LLM训练方法,能够自动评估模型的推理过程(推导正确性和数值正确性),并结合时间序列预测模型提升对未来稀疏数据的预测能力。
Method: 方法包括:(1)利用电子健康记录中的指南实例对LLM进行微调;(2)通过共识规则自动评估模型的推理过程;(3)将时间序列预测模型的表示与LLM结合,提升对未来数据的预测能力。实验基于Sepsis-3共识定义展开。
Result: 实验表明,微调的小型模型性能优于一次性学习的大型模型和仅训练于医学文本的模型。微调后模型在未见患者数据上的推导正确性接近完美,但对未来稀疏数据的预测仍有挑战。
Insight: 研究发现,医学早期预测的瓶颈并非分布外泛化,而是对稀疏和不规则采样数据的未来预测问题。通过多模态方法结合时间序列预测模型可以有效改善这一问题。
Abstract: Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model’s inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.
[20] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin,Zhiqi Bai,Xinmiao Zhang,Sen Yang,Xiang Li,Siran Yang,Yunlong Xu,Jiaheng Liu,Yongchi Zhao,Jiamang Wang,Yuchi Xu,Wenbo Su,Bo Zheng
Main category: cs.CL
TL;DR: 论文提出了FusedKV和FusedKV-Lite方法,通过跨层融合改善Transformer解码器中KV缓存的瓶颈问题,显著减少了内存占用并保持了性能。
Details
Motivation: Transformer解码器在长序列任务中KV缓存的内存需求过高,现有跨层共享方法(如YOCO、CLA)性能不如层内方法(如GQA),因此需要找到更优的KV缓存管理方式。Contribution: 1. 提出了FusedKV,通过可学习的跨层融合生成KV缓存;2. 提出了更高效的FusedKV-Lite;3. 在多个LLM规模上验证了方法的内存节省和性能优势。
Method: 1. 分析KV信息流分布,提出跨层融合策略;2. FusedKV直接融合底层和中层的信息;3. FusedKV-Lite简化融合过程以减少I/O开销。
Result: 实验表明,该方法在332M到4B参数的LLM上减少50%缓存内存的同时,验证困惑度更低。
Insight: KV缓存的信息分布不均衡,跨层融合能够有效利用底层和中层的互补信息,提升效率和性能。
Abstract: Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.
[21] BERnaT: Basque Encoders for Representing Natural Textual Diversity
Ekhi Azurmendi,Joseba Fernandez de Landa,Jaione Bengoetxea,Maite Heredia,Julen Etxaniz,Mikel Zubillaga,Ander Soraluze,Aitor Soroa
Main category: cs.CL
TL;DR: BERnaT论文提出了一个新的语言模型预训练方法,旨在捕捉语言的多样性(如方言、历史文本和非正式语言),而非仅依赖标准化文本。通过在巴斯克语上的实验,展示了结合多样数据训练的模型在自然语言理解任务上的优越性。
Details
Motivation: 当前语言模型依赖大量经过质量过滤的文本语料,可能无意中排除了非标准语言变体,降低了模型的鲁棒性并加剧了表示偏见。作者主张语言模型应捕捉全谱语言变化。Contribution: 1)构建了巴斯克语的多样化语料库(标准、社交媒体和历史文本);2)提出了BERnaT系列编码器模型;3)设计了评估框架,分离标准与多样化NLU任务。
Method: 1)预训练BERnaT模型,配置为标准、多样化和混合模式;2)通过评估框架分析模型在标准与多样化任务上的表现。
Result: 结果显示,结合标准与多样化数据训练的模型在所有任务类型上表现更优,且不影响标准基准的准确性。
Insight: 语言多样性对构建更具包容性和泛化能力的语言模型至关重要。
Abstract: Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.
[22] Jina-VLM: Small Multilingual Vision Language Model
Andreas Koukounas,Georgios Mastrapas,Florian Hönicke,Sedigheh Eslami,Guillaume Roncari,Scott Martens,Han Xiao
Main category: cs.CL
TL;DR: Jina-VLM是一种2.4B参数的多语言视觉语言模型,在小规模模型(2B级)中实现了最先进的多语言视觉问答性能。
Details
Motivation: 当前的视觉语言模型在多语言环境下表现有限,尤其是在小规模模型中。Jina-VLM旨在填补这一空白,提供高效的多语言视觉理解能力。Contribution: Jina-VLM结合SigLIP2视觉编码器和Qwen3语言模型,通过注意力池化连接器实现高效的多语言视觉问答性能。
Method: 模型采用SigLIP2作为视觉编码器,Qwen3作为语言模型,并用注意力池化连接器处理任意分辨率的图像,实现了高效的Token处理。
Result: 在多语言视觉问答基准测试中,Jina-VLM优于同类模型,同时保持了文本任务的竞争力。
Insight: 通过高效的多模态连接设计(如注意力池化),可以在小规模模型中实现强大的跨模态理解能力。
Abstract: We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
[23] SkillFactory: Self-Distillation For Learning Cognitive Behaviors
Zayne Sprague,Jack Lu,Manya Wadhwa,Sedrick Keh,Mengye Ren,Greg Durrett
Main category: cs.CL
TL;DR: SkillFactory提出了一种通过自我蒸馏(Self-Distillation)在监督微调(SFT)阶段预先学习认知行为的方法,避免依赖于更强模型的蒸馏,而是利用模型自身生成的样本重新排列为技能训练数据。这种方法在强化学习(RL)前帮助模型学习认知技能,并在后续RL阶段展现了更好的泛化能力和鲁棒性。
Details
Motivation: 如何让基础模型在强化学习(RL)前学习未在基础模型中展现的认知技能,是一个关键问题。SkillFactory旨在通过自我生成训练数据(而非依赖更强模型的蒸馏)来解决这一问题。Contribution: 1) 提出了SkillFactory方法,通过自我蒸馏在SFT阶段预先学习认知技能;2) 展示了这种方法在RL阶段的有效性,模型在更难的变体任务上表现更好;3) 证明模型确实利用了这些认知技能,且在域外任务上更具鲁棒性。
Method: SkillFactory的核心方法是在SFT阶段利用模型自身生成的样本重新排列为符合目标技能的训练数据(”silver” SFT traces),从而在RL前引导模型学习认知行为。
Result: 实验结果:1) SkillFactory初始化有助于模型在RL后泛化到更难任务;2) 模型实际利用了学习的认知技能;3) SkillFactory模型在域外任务中表现更鲁棒。
Insight: 在RL前通过SFT学习归纳偏置(inductive biases)有助于模型更稳健地使用认知技能。
Abstract: Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren’t exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These “silver” SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
cs.CV [Back]
[24] Hierarchical Process Reward Models are Symbolic Vision Learners
Shan Zhang,Aotian Chen,Kai Zou,Jindong Gu,Yuan Xue,Anton van den Hengel
Main category: cs.CV
TL;DR: 该论文提出了一种分层过程奖励模型的符号视觉学习方法,通过自监督的符号自动编码器将图表编码为结构化基元及其潜在空间中的相互关系,并通过分层步骤级解析奖励实现一致性。
Details
Motivation: 符号计算机视觉需要通过逻辑规则和结构化表示实现可解释的机器视觉理解,这与基于像素的视觉模型有本质不同。Contribution: 1. 提出了一种自监督的符号自动编码器;2. 引入了分层过程奖励模型;3. 设计了稳定机制以平衡探索与利用;4. 开发了结合神经与符号能力的系统。
Method: 使用符号自动编码器编码图表为基元及其关系,通过分层奖励模型(如点在线、线在形状、形状在关系上的约束)和稳定机制提升重建效果。
Result: 在几何图表重建中MSE减少了98.2%,图表重建任务超过GPT-4o 0.6%,MathGlance感知任务提升13%,MathVerse和GeoQA推理任务提升3%。
Insight: 结合神经网络的推理能力与符号模型的可解释性,通过分层奖励和稳定机制显著提升了符号视觉任务的性能。
Abstract: Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision. This requires fundamentally different learning paradigms from pixel-based visual models. Symbolic visual learners parse diagrams into geometric primitives-points, lines, and shapes-whereas pixel-based learners operate on textures and colors. We propose a novel self-supervised symbolic auto-encoder that encodes diagrams into structured primitives and their interrelationships within the latent space, and decodes them through our executable engine to reconstruct the input diagrams. Central to this architecture is Symbolic Hierarchical Process Reward Modeling, which applies hierarchical step-level parsing rewards to enforce point-on-line, line-on-shape, and shape-on-relation consistency. Since vanilla reinforcement learning exhibits poor exploration in the policy space during diagram reconstruction; we thus introduce stabilization mechanisms to balance exploration and exploitation. We fine-tune our symbolic encoder on downstream tasks, developing a neuro-symbolic system that integrates the reasoning capabilities of neural networks with the interpretability of symbolic models through reasoning-grounded visual rewards. Evaluations across reconstruction, perception, and reasoning tasks demonstrate the effectiveness of our approach: achieving a 98.2% reduction in MSE for geometric diagram reconstruction, surpassing GPT-4o by 0.6% with a 7B model on chart reconstruction, and improving by +13% on the MathGlance perception benchmark, and by +3% on MathVerse and GeoQA reasoning benchmarks.
[25] Drainage: A Unifying Framework for Addressing Class Uncertainty
Yasser Taha,Grégoire Montavon,Nils Körber
Main category: cs.CV
TL;DR: 该论文提出了一个统一框架来处理深度学习中标签噪声和类别不确定性问题,通过引入“排水节点”重新分配概率质量,显著提升了高噪声环境下的分类性能。
Details
Motivation: 现代深度学习在标签噪声、类别模糊性以及异常样本的鲁棒拒绝方面面临挑战,需一种统一框架解决这些问题。Contribution: 提出了一种基于“排水节点”的统一框架,重新分配输出概率,吸收不确定或噪声样本,同时保持端到端训练和可微性。
Method: 在网络输出端添加排水节点,动态重新分配概率质量至不确定性类别,适用于实例依赖和不对称标签噪声。
Result: 在CIFAR-10/100等数据集上,噪声标签下性能提升高达9%,在真实数据集(如mini-WebVision)上达到或超越SOTA方法。
Insight: 排水节点能自然吸收噪声或异常样本,稳定决策边界,并扩展到半监督数据清洗和开放集应用。
Abstract: Modern deep learning faces significant challenges with noisy labels, class ambiguity, as well as the need to robustly reject out-of-distribution or corrupted samples. In this work, we propose a unified framework based on the concept of a “drainage node’’ which we add at the output of the network. The node serves to reallocate probability mass toward uncertainty, while preserving desirable properties such as end-to-end training and differentiability. This mechanism provides a natural escape route for highly ambiguous, anomalous, or noisy samples, particularly relevant for instance-dependent and asymmetric label noise. In systematic experiments involving the addition of varying proportions of instance-dependent noise or asymmetric noise to CIFAR-10/100 labels, our drainage formulation achieves an accuracy increase of up to 9% over existing approaches in the high-noise regime. Our results on real-world datasets, such as mini-WebVision, mini-ImageNet and Clothing-1M, match or surpass existing state-of-the-art methods. Qualitative analysis reveals a denoising effect, where the drainage neuron consistently absorbs corrupt, mislabeled, or outlier data, leading to more stable decision boundaries. Furthermore, our drainage formulation enables applications well beyond classification, with immediate benefits for web-scale, semi-supervised dataset cleaning, and open-set applications.
[26] Does Head Pose Correction Improve Biometric Facial Recognition?
Justin Norman,Hany Farid
Main category: cs.CV
TL;DR: 研究发现,不加选择的头部姿态校正和图像修复技术会降低面部识别的准确性,但选择性应用CFR-GAN和CodeFormer能显著提升效果。
Details
Motivation: 现实世界中的人脸图像常因姿态不正、遮挡或低质量导致识别精度下降,研究旨在探索头部姿态校正和图像修复技术是否能改善这一问题。Contribution: 1. 评估了三种修复技术对面部识别的影响;2. 发现选择性结合CFR-GAN与CodeFormer能有效提升识别精度。
Method: 使用模型无关的大规模评估流程,测试了三种修复方法:3D重建(NextFace)、2D正面化(CFR-GAN)和特征增强(CodeFormer)。
Result: 不加选择的修复技术会降低识别精度,但选择性结合CFR-GAN和CodeFormer能显著改善效果。
Insight: 修复技术的应用需谨慎,选择性组合方法可能更适合实际场景。
Abstract: Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.
[27] Flux4D: Flow-based Unsupervised 4D Reconstruction
Jingkang Wang,Henry Che,Yun Chen,Ze Yang,Lily Goli,Sivabalan Manivasagam,Raquel Urtasun
Main category: cs.CV
TL;DR: Flux4D 是一个用于大规模动态场景4D重建的无监督框架,通过直接预测3D高斯及其运动动态,实现了高效、可扩展的重建。
Details
Motivation: 现有的可微分渲染方法(如NeRF和3DGS)在动态场景重建中存在扩展性限制和依赖标注的问题,而现有的无监督方法则受限于逐场景优化和对超参数的敏感性。Contribution: 提出Flux4D框架,通过无监督方式直接预测3D高斯及其运动动态,支持大规模动态场景的高效重建,无需预训练模型或标注。
Method: 仅采用光度损失和“尽可能静态”的正则化,通过多场景训练直接分解动态元素,实现大规模数据的高效重建。
Result: 在户外驾驶数据集上的实验表明,Flux4D在扩展性、泛化性和重建质量上显著优于现有方法。
Insight: Flux4D揭示了无监督多场景训练在动态场景重建中的潜力,为未来研究提供了新方向。
Abstract: Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision, with critical implications for robotics and autonomous systems. While recent differentiable rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple actor motion. Existing self-supervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic scenes. Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations in a fully unsupervised manner. By adopting only photometric losses and enforcing an “as static as possible” regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.
[28] Object Counting with GPT-4o and GPT-5: A Comparative Study
Richard Füzesséry,Kaziwa Saleh,Sándor Szénási,Zoltán Vámossy
Main category: cs.CV
TL;DR: 该论文比较了GPT-4o和GPT-5在多模态LLM中零样本目标计数的性能,展示了它们仅通过文本提示即可达到甚至超越现有零样本方法的成果。
Details
Motivation: 现有目标计数方法依赖大量标注数据或视觉示例,而多模态LLMs具备强大的推理和数据处理能力,可能无需监督即可完成任务。Contribution: - 提出利用GPT-4o和GPT-5的视觉能力进行零样本目标计数;
- 在FSC-147和CARPK数据集上进行了评测,部分情况下超越现有零样本方法。
Method: 通过文本提示引导GPT-4o和GPT-5进行零样本目标计数,无需视觉示例或监督训练。
Result: 在FSC-147数据集上表现与最先进的零样本方法相当,甚至在某些情况下超越它们。
Insight: 多模态LLMs(如GPT-4o和GPT-5)有望在无需标注数据的情况下完成复杂的视觉任务。
Abstract: Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training. Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process. However, large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision. In this work we aim to leverage the visual capabilities of two multi-modal LLMs, GPT-4o and GPT-5, to perform object counting in a zero-shot manner using only textual prompts. We evaluate both models on the FSC-147 and CARPK datasets and provide a comparative analysis. Our findings show that the models achieve performance comparable to the state-of-the-art zero-shot approaches on FSC-147, in some cases, even surpass them.
[29] LLM-Guided Material Inference for 3D Point Clouds
Nafiseh Izadyar,Teseo Schneider
Main category: cs.CV
TL;DR: 提出了一种基于两阶段LLM的方法,直接从带有粗分割的3D点云推断材料组成,无需任务特定训练,展示了LLM作为通用先验在几何推理和材料理解中的作用。
Details
Motivation: 现有3D形状数据集和模型主要关注几何形状,忽视了决定物体外观的材料属性,因此需要一种无需标注数据的材料推断方法。Contribution: 1. 提出了一种两阶段LLM方法,分离物体语义和材料推断;2. 展示了LLM在零样本任务中作为通用先验的潜力。
Method: 1. 第一阶段LLM预测物体语义;2. 第二阶段基于语义为几何分割分配材料。两阶段均为零样本推理。
Result: 在Fusion/ABS和ShapeNet的1,000个形状上,方法实现了高语义和材料可信度。
Insight: 语言模型可以作为几何推理和材料理解的桥梁,无需特定训练即可完成任务。
Abstract: Most existing 3D shape datasets and models focus solely on geometry, overlooking the material properties that determine how objects appear. We introduce a two-stage large language model (LLM) based method for inferring material composition directly from 3D point clouds with coarse segmentations. Our key insight is to decouple reasoning about what an object is from what it is made of. In the first stage, an LLM predicts the object’s semantic; in the second stage, it assigns plausible materials to each geometric segment, conditioned on the inferred semantics. Both stages operate in a zero-shot manner, without task-specific training. Because existing datasets lack reliable material annotations, we evaluate our method using an LLM-as-a-Judge implemented in DeepEval. Across 1,000 shapes from Fusion/ABS and ShapeNet, our method achieves high semantic and material plausibility. These results demonstrate that language models can serve as general-purpose priors for bridging geometric reasoning and material understanding in 3D data.
[30] 2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
Liying Lu,Raphaël Achddou,Sabine Süsstrunk
Main category: cs.CV
TL;DR: 该论文提出了一种低光照条件下仅需单张噪声图像和单张暗帧即可实现高质量噪声合成的通用方法,无需大规模配对数据集,并在多个基准测试中取得领先性能。
Details
Motivation: 低光照条件下拍摄的原始图像因光子计数低和传感器噪声而噪声严重,学习型去噪模型需要大量干净-噪声图像对进行训练,但此类数据难以获取。噪声合成是一种替代方案,但现有方法要么依赖简化参数模型,要么需要大规模配对数据。Contribution: 提出了一种仅需单张噪声图像和单张暗帧的通用噪声合成方法,通过基于泊松分布的信号依赖噪声建模和傅里叶域谱采样算法,实现了高保真噪声合成。
Method: 1. 用泊松分布建模信号依赖噪声;2. 通过傅里叶域谱采样算法生成多样化的信号独立噪声;3. 结合两种噪声生成高保真合成噪声图像。
Result: 在多个低光照去噪基准测试中,该方法生成的合成噪声图像显著提升了去噪模型的性能,表现优于现有方法。
Insight: 通过极简数据(单张噪声图像+单张暗帧)即可高保真合成噪声,解决了低光照去噪中数据获取难题,为实际应用提供了可行的数据增强方案。
Abstract: Raw images taken in low-light conditions are very noisy due to low photon count and sensor noise. Learning-based denoisers have the potential to reconstruct high-quality images. For training, however, these denoisers require large paired datasets of clean and noisy images, which are difficult to collect. Noise synthesis is an alternative to large-scale data acquisition: given a clean image, we can synthesize a realistic noisy counterpart. In this work, we propose a general and practical noise synthesis method that requires only one single noisy image and one single dark frame per ISO setting. We represent signal-dependent noise with a Poisson distribution and introduce a Fourier-domain spectral sampling algorithm to accurately model signal-independent noise. The latter generates diverse noise realizations that maintain the spatial and statistical properties of real sensor noise. As opposed to competing approaches, our method neither relies on simplified parametric models nor on large sets of clean-noisy image pairs. Our synthesis method is not only accurate and practical, it also leads to state-of-the-art performances on multiple low-light denoising benchmarks.
[31] SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding
Hongpei Zheng,Shijie Li,Yanran Li,Hujun Yin
Main category: cs.CV
TL;DR: 提出H$^2$U3D数据集和SpatialReasoner框架,用于大规模3D场景理解,通过主动感知和强化学习实现高效探索。
Details
Motivation: 当前视觉语言模型的空间推理能力受限于小规模场景,需扩展到房屋级3D环境理解。Contribution: 1) H$^2$U3D数据集,支持房屋级3D场景理解;2) SpatialReasoner框架,结合主动感知和强化学习。
Method: 1) 自动化标注管道构建层次化视觉表示;2) 两阶段训练策略(监督冷启动+自适应探索奖励的强化学习)。
Result: 在H$^2$U3D上达到SOTA,仅需3-4张图像,优于基线16+张图像的需求。
Insight: 粗到细的主动探索范式显著提升效率,适合大规模3D场景理解。
Abstract: Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H$^2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H$^2$U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m$^2$. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H$^2$U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.
[32] NavMapFusion: Diffusion-based Fusion of Navigation Maps for Online Vectorized HD Map Construction
Thomas Monninger,Zihan Zhang,Steffen Staab,Sihao Ding
Main category: cs.CV
TL;DR: NavMapFusion是一种基于扩散模型的框架,用于在线构建高精度地图,融合低分辨率导航地图和高精度传感器数据,显著提升自动驾驶的环境表示能力。
Details
Motivation: 自动驾驶需要准确的环境表示,但传统的高清地图(HD)无法实时更新。作者提出利用广泛可用的低分辨率导航地图(SD)作为先验,结合传感器数据在线构建地图,解决动态环境下的地图更新问题。Contribution: 提出了NavMapFusion框架,首次将扩散模型应用于地图融合任务,展示了低分辨率导航地图如何引导在线地图构建,并验证了扩散模型在处理地图差异上的优势。
Method: 采用扩散模型进行迭代去噪,将导航地图与高精度传感器数据融合。地图中的不一致区域被视为噪声,通过扩散过程抑制错误信息,强化一致区域。
Result: 在nuScenes数据集上,NavMapFusion在100米范围内相对提升了21.4%,尤其在更大感知范围下表现更优,且保持实时性。
Insight: 扩散模型能够自然地将地图不一致部分视为噪声,从而高效融合多源数据,生成高精度且实时的环境表示。
Abstract: Accurate environmental representations are essential for autonomous driving, providing the foundation for safe and efficient navigation. Traditionally, high-definition (HD) maps are providing this representation of the static road infrastructure to the autonomous system a priori. However, because the real world is constantly changing, such maps must be constructed online from on-board sensor data. Navigation-grade standard-definition (SD) maps are widely available, but their resolution is insufficient for direct deployment. Instead, they can be used as coarse prior to guide the online map construction process. We propose NavMapFusion, a diffusion-based framework that performs iterative denoising conditioned on high-fidelity sensor data and on low-fidelity navigation maps. This paper strives to answer: (1) How can coarse, potentially outdated navigation maps guide online map construction? (2) What advantages do diffusion models offer for map fusion? We demonstrate that diffusion-based map construction provides a robust framework for map fusion. Our key insight is that discrepancies between the prior map and online perception naturally correspond to noise within the diffusion process; consistent regions reinforce the map construction, whereas outdated segments are suppressed. On the nuScenes benchmark, NavMapFusion conditioned on coarse road lines from OpenStreetMap data reaches a 21.4% relative improvement on 100 m, and even stronger improvements on larger perception ranges, while maintaining real-time capabilities. By fusing low-fidelity priors with high-fidelity sensor data, the proposed method generates accurate and up-to-date environment representations, guiding towards safer and more reliable autonomous driving. The code is available at https://github.com/tmonnin/navmapfusion
[33] Step-by-step Layered Design Generation
Faizan Farooq Khan,K J Joseph,Koustava Goswami,Mohamed Elhoseiny,Balaji Vasan Srinivasan
Main category: cs.CV
TL;DR: 论文提出了一种分步分层设计生成的新问题设置和模型SLEDGE,利用多模态大语言模型逐步生成设计,并通过新评估套件验证其有效性。
Details
Motivation: 现有设计生成方法通常将问题视为单步生成任务,忽略了设计过程的逐步复杂性,因此需要一种更贴近实际的设计生成方法。Contribution: 提出了Step-by-Step Layered Design Generation问题设置以及SLEDGE模型,能够根据设计师指令逐步生成设计,并引入了新的评估数据集和基准。
Method: SLEDGE模型通过多模态LLM将设计更新建模为原子化的分层变化,并根据指令逐步调整设计状态。
Result: 实验表明SLEDGE在逐步生成设计中优于现有方法。
Insight: 设计生成是一个逐步优化的过程,分步分层的方法更贴近实际设计场景,具有实用价值。
Abstract: Design generation, in its essence, is a step-by-step process where designers progressively refine and enhance their work through careful modifications. Despite this fundamental characteristic, existing approaches mainly treat design synthesis as a single-step generation problem, significantly underestimating the inherent complexity of the creative process. To bridge this gap, we propose a novel problem setting called Step-by-Step Layered Design Generation, which tasks a machine learning model with generating a design that adheres to a sequence of instructions from a designer. Leveraging recent advancements in multi-modal LLMs, we propose SLEDGE: Step-by-step LayEred Design GEnerator to model each update to a design as an atomic, layered change over its previous state, while being grounded in the instruction. To complement our new problem setting, we introduce a new evaluation suite, including a dataset and a benchmark. Our exhaustive experimental analysis and comparison with state-of-the-art approaches tailored to our new setup demonstrate the efficacy of our approach. We hope our work will attract attention to this pragmatic and under-explored research area.
[34] ProtoEFNet: Dynamic Prototype Learning for Inherently Interpretable Ejection Fraction Estimation in Echocardiography
Yeganeh Ghamary,Victoria Wu,Hooman Vaseli,Christina Luong,Teresa Tsang,Siavash Bigdeli,Purang Abolmaesumi
Main category: cs.CV
TL;DR: ProtoEFNet是一个基于视频的原型学习模型,用于心脏超声图像中的射血分数(EF)连续回归,通过学习动态时空原型提供临床可解释性。
Details
Motivation: EF是评估心脏功能的关键指标,但传统方法依赖手动标记和专家知识,耗时且易受主观影响。深度学习方法多为黑盒模型,缺乏透明度,限制了临床信任。现有的事后解释方法无法指导模型内部推理。Contribution: 提出了ProtoEFNet,一种动态原型学习模型,通过捕捉临床意义的心脏运动模式提供固有可解释性;设计了Prototype Angular Separation(PAS)损失函数,增强了在连续EF谱上的区分性表示。
Method: 模型学习动态时空原型,捕捉心脏运动的临床模式;PAS损失函数通过角度分离提升原型在不同EF值间的区分性。
Result: 在EchonetDynamic数据集上,ProtoEFNet的准确性可与非可解释模型媲美;PAS损失使F1分数从77.67±2.68提升至79.64±2.10。
Insight: 动态原型学习能够平衡模型性能和临床可解释性,PAS损失函数的设计有效地提升了模型区分不同EF值的能力。
Abstract: Ejection fraction (EF) is a crucial metric for assessing cardiac function and diagnosing conditions such as heart failure. Traditionally, EF estimation requires manual tracing and domain expertise, making the process time-consuming and subject to interobserver variability. Most current deep learning methods for EF prediction are black-box models with limited transparency, which reduces clinical trust. Some post-hoc explainability methods have been proposed to interpret the decision-making process after the prediction is made. However, these explanations do not guide the model’s internal reasoning and therefore offer limited reliability in clinical applications. To address this, we introduce ProtoEFNet, a novel video-based prototype learning model for continuous EF regression. The model learns dynamic spatiotemporal prototypes that capture clinically meaningful cardiac motion patterns. Additionally, the proposed Prototype Angular Separation (PAS) loss enforces discriminative representations across the continuous EF spectrum. Our experiments on the EchonetDynamic dataset show that ProtoEFNet can achieve accuracy on par with its non-interpretable counterpart while providing clinically relevant insight. The ablation study shows that the proposed loss boosts performance with a 2% increase in F1 score from 77.67$\pm$2.68 to 79.64$\pm$2.10. Our source code is available at: https://github.com/DeepRCL/ProtoEF
[35] SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
Yu Yuan,Tharindu Wickremasinghe,Zeeshan Nadir,Xijun Wang,Yiheng Chi,Stanley H. Chan
Main category: cs.CV
TL;DR: SeeU提出了一个2D→4D→2D的学习框架,通过建模4D动态实现连续的、物理一致的新视觉内容生成,展示了在时间生成、空间生成和视频编辑中的潜力。
Details
Motivation: 当前的视觉理解、预测和生成方法直接基于2D观测,导致性能不佳,缺乏对4D世界(3D空间+时间)的建模。Contribution: 提出SeeU方法,通过2D→4D→2D框架学习连续4D动态,并生成未见的视觉内容。
Method: 参见U首先从稀疏的单目2D帧重建4D世界(2D→4D),然后在低秩表示和物理约束下学习连续4D动态(离散4D→连续4D),最后将世界在时间上向前滚动,重投影回2D并生成未见区域(4D→2D)。
Result: SeeU在连续性和物理一致性的新颖视觉生成中表现出色,适用于未见过的时间生成、空间生成和视频编辑任务。
Insight: 通过建模4D动态,可以更有效地生成物理一致的视觉内容,为复杂视觉任务提供了新思路。
Abstract: Images and videos are discrete 2D projections of the 4D world (3D space + time). Most visual understanding, prediction, and generation operate directly on 2D observations, leading to suboptimal performance. We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. The principle behind SeeU is a new 2D$\to$4D$\to$2D learning framework. SeeU first reconstructs the 4D world from sparse and monocular 2D frames (2D$\to$4D). It then learns the continuous 4D dynamics on a low-rank representation and physical constraints (discrete 4D$\to$continuous 4D). Finally, SeeU rolls the world forward in time, re-projects it back to 2D at sampled times and viewpoints, and generates unseen regions based on spatial-temporal context awareness (4D$\to$2D). By modeling dynamics in 4D, SeeU achieves continuous and physically-consistent novel visual generation, demonstrating strong potentials in multiple tasks including unseen temporal generation, unseen spatial generation, and video editing.
[36] A Hybrid Deep Learning Framework with Explainable AI for Lung Cancer Classification with DenseNet169 and SVM
Md Rashidul Islam,Bakary Gibba,Altagi Abdallah Bakheit Abdelgadir
Main category: cs.CV
TL;DR: 论文提出了一种结合DenseNet169和SVM的混合深度学习框架,用于肺癌分类,并通过可解释AI技术(如GradCAM和SHAP)提高模型透明度。
Details
Motivation: 肺癌的早期诊断对提高患者生存率至关重要,但传统的CT扫描手动分析耗时且易出错,因此需要高效且可解释的自动化分类方法。Contribution: 提出了一个混合深度学习框架,结合了DenseNet169和SVM,并引入了可解释AI技术以增强模型的透明度和分类性能。
Method: 使用DenseNet169(含SE模块)进行注意力特征提取,结合FPN进行多尺度特征融合,并用SVM(基于MobileNetV2特征)进一步提高分类性能。同时,集成了Grad-CAM和SHAP进行模型可解释性分析。
Result: DenseNet169和SVM模型均达到98%的分类准确率,验证了其在实际医疗应用中的潜力。
Insight: 通过深度学习与可解释AI的结合,可以在提高分类精度的同时,增强模型的可信度和透明度,对医疗诊断具有重要意义。
Abstract: Lung cancer is a very deadly disease worldwide, and its early diagnosis is crucial for increasing patient survival rates. Computed tomography (CT) scans are widely used for lung cancer diagnosis as they can give detailed lung structures. However, manual interpretation is time-consuming and prone to human error. To surmount this challenge, the study proposes a deep learning-based automatic lung cancer classification system to enhance detection accuracy and interpretability. The IQOTHNCCD lung cancer dataset is utilized, which is a public CT scan dataset consisting of cases categorized into Normal, Benign, and Malignant and used DenseNet169, which includes Squeezeand-Excitation blocks for attention-based feature extraction, Focal Loss for handling class imbalance, and a Feature Pyramid Network (FPN) for multi-scale feature fusion. In addition, an SVM model was developed using MobileNetV2 for feature extraction, improving its classification performance. For model interpretability enhancement, the study integrated Grad-CAM for the visualization of decision-making regions in CT scans and SHAP (Shapley Additive Explanations) for explanation of feature contributions within the SVM model. Intensive evaluation was performed, and it was found that both DenseNet169 and SVM models achieved 98% accuracy, suggesting their robustness for real-world medical practice. These results open up the potential for deep learning to improve the diagnosis of lung cancer by a higher level of accuracy, transparency, and robustness.
[37] FireSentry: A Multi-Modal Spatio-temporal Benchmark Dataset for Fine-Grained Wildfire Spread Forecasting
Nan Zhou,Huandong Wang,Jiahao Li,Han Li,Yali Song,Qiuhua Wang,Yong Li,Xinlei Chen
Main category: cs.CV
TL;DR: FireSentry是一个多模态时空基准数据集,用于细粒度野火蔓延预测。相比现有低分辨率卫星数据,其提供亚米级空间和亚秒级时间分辨率的多模态数据。FireSentry支持物理、数据驱动和生成模型基准测试,并提出FiReDiff范式,显著提升预测性能。
Details
Motivation: 现有野火预测研究多依赖低分辨率卫星数据,难以建模高精度局部火势动态。需要细粒度、高分辨率数据集和方法以提升预测能力。Contribution: 1) 发布FireSentry,首个多模态细粒度野火数据集;2) 建立综合基准测试框架;3) 提出FiReDiff双模态预测范式,性能领先。
Method: FiReDiff结合红外模态的视频序列预测和掩模模态的精确分割,利用生成模型动态生成未来火势。
Result: FiReDiff显著提升性能:视频质量(PSNR+39.2%)和掩模精度(F1+59.1%)。
Insight: 多模态数据(红外+可见光)和动态预测的结合是提升野火细粒度建模的关键。
Abstract: Fine-grained wildfire spread prediction is crucial for enhancing emergency response efficacy and decision-making precision. However, existing research predominantly focuses on coarse spatiotemporal scales and relies on low-resolution satellite data, capturing only macroscopic fire states while fundamentally constraining high-precision localized fire dynamics modeling capabilities. To bridge this gap, we present FireSentry, a provincial-scale multi-modal wildfire dataset characterized by sub-meter spatial and sub-second temporal resolution. Collected using synchronized UAV platforms, FireSentry provides visible and infrared video streams, in-situ environmental measurements, and manually validated fire masks. Building on FireSentry, we establish a comprehensive benchmark encompassing physics-based, data-driven, and generative models, revealing the limitations of existing mask-only approaches. Our analysis proposes FiReDiff, a novel dual-modality paradigm that first predicts future video sequences in the infrared modality, and then precisely segments fire masks in the mask modality based on the generated dynamics. FiReDiff achieves state-of-the-art performance, with video quality gains of 39.2% in PSNR, 36.1% in SSIM, 50.0% in LPIPS, 29.4% in FVD, and mask accuracy gains of 3.3% in AUPRC, 59.1% in F1 score, 42.9% in IoU, and 62.5% in MSE when applied to generative models. The FireSentry benchmark dataset and FiReDiff paradigm collectively advance fine-grained wildfire forecasting and dynamic disaster simulation. The processed benchmark dataset is publicly available at: https://github.com/Munan222/FireSentry-Benchmark-Dataset.
[38] ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding
Lingjun Zhao,Yandong Luo,James Hay,Lu Gan
Main category: cs.CV
TL;DR: ShelfGaussian提出了一种基于高斯分布的3D场景理解框架,利用现成的视觉基础模型(VFMs)进行监督,支持多模态和开放词汇任务,表现出色。
Details
Motivation: 现有高斯方法在3D场景理解中要么依赖封闭语义标签,忽略渲染能力,要么仅依赖2D自监督导致几何退化。ShelfGaussian旨在融合多模态和开放词汇优势。Contribution: 1. 提出多模态高斯变换器,支持多模态特征查询;2. 引入Shelf-Supervised学习范式,联合优化2D和3D特征;3. 在开放词汇3D语义理解中实现SOTA性能。
Method: 1. 使用多模态高斯变换器从多模态传感器中提取特征;2. 通过Shelf-Supervised学习范式联合优化2D图像和3D场景的高斯特征;3. 利用VFMs提供监督信号。
Result: 在Occ3D-nuScenes数据集上实现了零样本语义占据预测的SOTA性能,并在真实UGV场景中验证了有效性。
Insight: 通过融合多模态和开放词汇监督,Gaussian方法在3D场景理解中展现出更强的泛化能力和实用性。
Abstract: We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: https://lunarlab-gatech.github.io/ShelfGaussian/.
[39] MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification
Yujian Zhao,Hankun Liu,Guanglin Niu
Main category: cs.CV
TL;DR: 论文提出了MOS框架,用于解决光学与SAR图像之间的模态差异问题,实现了跨模态船舶重识别的特征学习优化。
Details
Motivation: 光学和SAR图像之间的显著模态差异是跨模态船舶重识别的主要挑战,现有方法难以有效对齐模态特征。Contribution: MOS框架通过模态一致性表示学习和跨模态数据生成与特征融合,显著提升了跨模态船舶重识别的性能。
Method: 1. MCRL模块通过去噪SAR图像处理和类内模态对齐损失,对齐跨模态特征分布;2. CDGF模块利用扩散模型生成跨模态样本,增强特征融合与判别性。
Result: 在HOSS ReID数据集上,MOS在所有评测协议中显著超越现有方法,R1准确率最高提升16.4%。
Insight: 模态一致性学习和跨模态数据合成的结合是解决跨模态船舶重识别问题的有效途径。
Abstract: Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to mitigate the optical-SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL, Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.
[40] ViDiC: Video Difference Captioning
Jiangtao Wu,Shihao Li,Zhaozhou Bian,Yuanxing Zhang,Jialu Chen,Runzhe Wen,An Ping,Yiwen He,Jiakai Wang,Jiaheng Liu
Main category: cs.CV
TL;DR: 该论文提出了视频差异描述任务ViDiC及其数据集ViDiC-1K,用于评估多模态大语言模型在视频对之间描述差异和相似性的能力,揭示了现有模型的性能差距。
Details
Motivation: 现有视觉语言系统在动态场景中的差异理解能力不足,尤其是无法捕捉时间连续性、事件演化或编辑一致性。Contribution: 1. 提出ViDiC任务及其数据集ViDiC-1K;2. 设计双清单评估框架;3. 揭示了19种多模态模型的性能差距。
Method: 通过ViDiC-1K数据集和双清单评估框架,基于LLM-as-a-Judge协议衡量模型差异和相似性描述能力。
Result: 实验表明现有模型在差异感知和比较描述能力上存在显著性能差距。
Insight: ViDiC-1K为视频理解、编辑感知和比较推理提供了坚实基础,可作为未来研究的挑战性基准。
Abstract: Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes–a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.
[41] Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features
Yuzhen Hu,Biplab Banerjee,Saurabh Prasad
Main category: cs.CV
TL;DR: 论文提出了一种标签高效的高光谱图像分类方法,利用预训练的扩散模型提取空间特征,并通过轻量级的FiLM融合模块整合光谱和空间信息,显著提升了分类性能。
Details
Motivation: 高光谱成像(HSI)的标签稀缺性和低空间分辨率限制了分类性能,因此需要一种能够有效利用预训练模型和稀疏标签的方法。Contribution: 主要贡献包括:1) 利用预训练扩散模型提取低层空间特征;2) 提出轻量级的FiLM融合模块,实现光谱与空间信息的自适应融合;3) 在稀疏标签下实现高性能分类。
Method: 方法包括:1) 从预训练的扩散模型中提取早期去噪时间步的高分辨率特征;2) 设计FiLM模块,用光谱信息调制冻结的空间特征;3) 在稀疏监督下进行鲁棒的多模态学习。
Result: 实验表明,该方法在两大数据集上优于现有方法,尤其在稀疏标签条件下表现出色。消融实验验证了扩散特征和光谱融合的有效性。
Insight: 预训练的扩散模型可用于跨领域的标签高效表征学习,为遥感和其他科学成像任务提供了新思路。
Abstract: Hyperspectral imaging (HSI) enables detailed land cover classification, yet low spatial resolution and sparse annotations pose significant challenges. We present a label-efficient framework that leverages spatial features from a frozen diffusion model pretrained on natural images. Our approach extracts low-level representations from high-resolution decoder layers at early denoising timesteps, which transfer effectively to the low-texture structure of HSI. To integrate spectral and spatial information, we introduce a lightweight FiLM-based fusion module that adaptively modulates frozen spatial features using spectral cues, enabling robust multimodal learning under sparse supervision. Experiments on two recent hyperspectral datasets demonstrate that our method outperforms state-of-the-art approaches using only the provided sparse training labels. Ablation studies further highlight the benefits of diffusion-derived features and spectral-aware fusion. Overall, our results indicate that pretrained diffusion models can support domain-agnostic, label-efficient representation learning for remote sensing and broader scientific imaging tasks.
[42] Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation
Xieji Li,Siyuan Yan,Yingsheng Liu,H. Peter Soyer,Monika Janda,Victoria Mar,Zongyuan Ge
Main category: cs.CV
TL;DR: 该论文提出了一种医学视觉-语言预训练框架,结合多智能体数据生成(MAGEN)和基于本体的多知识增强(O-MAKE),解决了医学图像-文本对中噪声和长文本复杂性的问题。
Details
Motivation: 现有的医学视觉-语言预训练方法在处理网络收集的噪声数据和长文本时表现不佳,需要更鲁棒的方法来解决这些问题。Contribution: 1)设计了MAGEN系统提升数据质量;2)提出了O-MAKE方法分解长文本为知识块,支持细粒度对齐和医学概念建模。
Method: 1)MAGEN通过基于检索的验证和基础模型生成高质描述;2)O-MAKE利用本体指导机制分解文本并建模概念关系。
Result: 在皮肤病学领域,该方法在疾病分类和跨模态检索任务中达到了SOTA零样本性能,并发布了包含40万图像-文本对的数据集Derm1M-AgentAug。
Insight: 通过智能体生成高质量数据和本体引导的知识分解,可以有效解决医学视觉-语言预训练中的噪声和长文本复杂性问题。
Abstract: Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.
[43] KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models
Rhys Newbury,Juyan Zhang,Tin Tran,Hanna Kurniawati,Dana Kulić
Main category: cs.CV
TL;DR: 论文提出了一种无监督学习3D关键点的方法KeyPointDiffuser,通过潜在扩散模型(Latent Diffusion Models)从点云数据中学习空间结构化的关键点,并用于重构完整形状。
Details
Motivation: 现有的无监督关键点学习方法通常不适用于无条件生成任务,限制了其在现代3D生成流程中的应用。本文旨在填补这一空白。Contribution: 1. 提出了一种无监督学习的框架,能够从点云数据中提取空间结构化的3D关键点;2. 这些关键点作为紧凑且可解释的表征条件化扩散模型(EDM),用于形状重构。
Method: 方法结合了无监督关键点学习和潜在扩散模型:1. 从点云中学习关键点;2. 利用关键点作为条件信息,通过EDM(Elucidated Diffusion Model)重构3D形状。
Result: 在多样化的物体类别上表现优异,关键点一致性比现有方法提高了6个百分点。
Insight: 学习到的关键点能够跨实例展现重复的空间结构,支持关键点空间内的平滑插值,表明其捕捉了几何变化。
Abstract: Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.
[44] GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers
Zhiye Song,Steve Dai,Ben Keller,Brucek Khailany
Main category: cs.CV
TL;DR: GalaxyDiT是一种无需训练的加速视频生成方法,通过引导对齐和自适应代理选择,显著提升了Diffusion Transformers的计算效率,同时保持了生成视频的高保真度。
Details
Motivation: 尽管Diffusion Transformers和分类器无关引导(CFG)在视频生成中表现出色,但其计算成本高昂,限制了在下游应用中的广泛使用。GalaxyDiT旨在解决这一问题。Contribution: 1. 提出一种无需训练的加速方法;2. 通过引导对齐和自适应代理选择优化计算复用;3. 在保持高质量的同时显著提升速度。
Method: 利用秩相关分析选择最优代理,并采用引导对齐技术加速视频生成,无需额外训练。
Result: 在Wan2.1-1.3B和Wan2.1-14B上分别实现了1.87倍和2.37倍的加速,VBench-2.0基准测试中仅损失0.97%和0.72%。在高速率下,PSNR优于现有最佳方法5至10 dB。
Insight: 优化代理选择和引导对齐可以显著提升Diffusion模型的效率,同时不牺牲生成质量,为大规模视频生成提供了可行方案。
Abstract: Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications. We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
[45] GeoVideo: Introducing Geometric Regularization into Video Generation Model
Yunpeng Bai,Shaoheng Fang,Chaohui Yu,Fan Wang,Qixing Huang
Main category: cs.CV
TL;DR: GeoVideo通过引入几何正则化损失,增强视频生成模型对3D结构的建模能力,解决了现有方法因缺乏显式3D建模而导致的几何不一致和运动不合理问题。
Details
Motivation: 当前视频生成模型主要在2D像素空间操作,缺乏对3D结构的显式建模,导致生成的视频在时间上几何不一致、运动不合理并存在结构伪影。Contribution: 论文的核心贡献是将几何正则化损失引入视频生成,通过深度预测和多视角几何损失来增强3D空间的一致性。
Method: 方法结合了隐式扩散模型和逐帧深度预测,提出了一种多视角几何损失,用于在共享3D坐标系中对齐深度图。
Result: 实验表明,该方法在多个数据集上生成的视频在时空一致性、形状一致性和物理合理性上显著优于现有基线。
Insight: 通过引入3D几何结构建模,可以显著提升视频生成模型的质量和一致性,尤其是在处理复杂场景和运动时效果更佳。
Abstract: Recent advances in video generation have enabled the synthesis of high-quality and visually realistic clips using diffusion transformer models. However, most existing approaches operate purely in the 2D pixel space and lack explicit mechanisms for modeling 3D structures, often resulting in temporally inconsistent geometries, implausible motions, and structural artifacts. In this work, we introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction. We adopted depth as the geometric representation because of the great progress in depth prediction and its compatibility with image-based latent encoders. Specifically, to enforce structural consistency over time, we propose a multi-view geometric loss that aligns the predicted depth maps across frames within a shared 3D coordinate system. Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved spatio-temporal coherence, shape consistency, and physical plausibility. Experiments across multiple datasets show that our approach produces significantly more stable and geometrically consistent results than existing baselines.
[46] Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
Haicheng Liao,Huanming Shen,Bonan Wang,Yongkang Li,Yihong Tang,Chengyue Wang,Dingyi Zhuang,Kehua Chen,Hai Yang,Chengzhong Xu,Zhenning Li
Main category: cs.CV
TL;DR: ThinkDeeper是一个基于世界模型的框架,通过预测未来空间状态来解决自动驾驶中的视觉定位问题,结合超图解码器实现鲁棒定位,并在多个基准测试中表现优异。
Details
Motivation: 现有自动驾驶视觉定位方法在模糊、依赖上下文的指令中表现不佳,缺乏对3D空间关系和场景演变的推理能力。Contribution: 提出了ThinkDeeper框架,包含空间感知世界模型(SA-WM)和超图解码器,设计了一个多源数据集DrivePilot,并在多个基准测试中取得最佳性能。
Method: SA-WM通过提炼当前场景到潜在状态并预测未来状态,超图解码器分层融合多模态输入和潜在状态。数据集DrivePilot通过RAG和CoT提示的LLM生成语义标注。
Result: ThinkDeeper在Talk2Car等六个基准测试中排名第一,表现鲁棒且高效,即使在50%数据训练下仍保持优越性能。
Insight: 通过前瞻性推理和超图建模空间依赖关系,可以显著提升自动驾驶中视觉定位的准确性和鲁棒性。
Abstract: Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
[47] Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Shojiro Yamabe,Futa Waseda,Daiki Shiono,Tsubasa Takahashi
Main category: cs.CV
TL;DR: 该论文提出了一种名为Text-Printed Image (TPI)的方法,通过将文本直接渲染在白色画布上生成合成图像,以低成本弥合图像-文本模态差距,从而显著提升大型视觉语言模型(VLMs)在仅文本数据训练中的表现。
Details
Motivation: 传统的大型视觉语言模型训练需要大量图像-文本对,但获取这些数据成本高且受限。文本数据广泛且易于编辑,但直接训练效果有限,因模态差距。TPI旨在低成本解决这一问题。Contribution: 1. 提出TPI方法,通过简单渲染文本生成合成图像;2. 证明TPI在文本中心训练中优于扩散模型生成的图像;3. 探索TPI作为低成本数据增强策略的实用性。
Method: TPI通过将文本直接渲染到白色画布上生成合成图像,保留了文本语义,且易于集成到现有模型中。
Result: 在四个模型和七个基准测试中,TPI显著提升了仅文本训练的VLMs性能,优于扩散模型生成的图像。
Insight: TPI提供了一种低成本、高效的训练方法,展示了文本中心训练的潜力,并为完全自动化数据生成提供了新方向。
Abstract: Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
[48] Difference Decomposition Networks for Infrared Small Target Detection
Chen Hu,Mingyu Zhou,Shuai Yuan,Hongbo Hu,Xiangyu Qiu,Junhai Luo,Tian Pu,Xiyin Li
Main category: cs.CV
TL;DR: 论文提出了一种基于基分解的轻量级模块BDM,并扩展出一系列模块(如SD²M、SD³M和TD²M),构建了SD²Net和STD²Net网络,分别用于单帧和多帧红外小目标检测,取得了SOTA性能。
Details
Motivation: 红外小目标检测面临目标纹理不明显和背景杂波严重的问题,导致目标被背景遮蔽,需要通过增强目标和抑制背景来解决。Contribution: 提出了BDM模块及其扩展模块(SD²M、SD³M和TD²M),构建了SD²Net和STD²Net网络,显著提升了单帧和多帧红外小目标检测的性能。
Method: 基于基分解的BDM模块分解复杂特征并增强有效信息;扩展模块SD²M、SD³M和TD²M分别用于空间和时空特征处理;SD²Net和STD²Net结合这些模块,分别解决单帧和多帧检测问题。
Result: 在SISTD任务中,SD²Net性能优异;在MISTD任务中,STD²Net的mIoU达到87.68%,显著优于SD²Net的64.97%。
Insight: 基分解方法能有效增强目标特征并抑制冗余信息,时空信息的引入进一步提升了多帧检测的性能。
Abstract: Infrared small target detection (ISTD) faces two major challenges: a lack of discernible target texture and severe background clutter, which results in the background obscuring the target. To enhance targets and suppress backgrounds, we propose the Basis Decomposition Module (BDM) as an extensible and lightweight module based on basis decomposition, which decomposes a complex feature into several basis features and enhances certain information while eliminating redundancy. Extending BDM leads to a series of modules, including the Spatial Difference Decomposition Module (SD$^\mathrm{2}$M), Spatial Difference Decomposition Downsampling Module (SD$^\mathrm{3}$M), and Temporal Difference Decomposition Module (TD$^\mathrm{2}$M). Based on these modules, we develop the Spatial Difference Decomposition Network (SD$^\mathrm{2}$Net) for single-frame ISTD (SISTD) and the Spatiotemporal Difference Decomposition Network (STD$^\mathrm{2}$Net) for multi-frame ISTD (MISTD). SD$^\mathrm{2}$Net integrates SD$^\mathrm{2}$M and SD$^\mathrm{3}$M within an adapted U-shaped architecture. We employ TD$^\mathrm{2}$M to introduce motion information, which transforms SD$^\mathrm{2}$Net into STD$^\mathrm{2}$Net. Extensive experiments on SISTD and MISTD datasets demonstrate state-of-the-art (SOTA) performance. On the SISTD task, SD$^\mathrm{2}$Net performs well compared to most established networks. On the MISTD datasets, STD$^\mathrm{2}$Net achieves a mIoU of 87.68%, outperforming SD$^\mathrm{2}$Net, which achieves a mIoU of 64.97%. Our codes are available: https://github.com/greekinRoma/IRSTD_HC_Platform.
[49] Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis
Zijian Gu,Yuxi Liu,Zhenhao Zhang,Song Wang
Main category: cs.CV
TL;DR: 该论文提出了一种公平性感知的微调方法(FR-LoRA、GR-LoRA和Hybrid-LoRA),用于优化医疗视觉-语言模型(VLMs)的诊断公平性,同时保持高准确率。通过MaxAccGap损失函数和低频梯度加权,显著减少了跨人口群体的准确性差异。
Details
Motivation: 医疗视觉-语言模型在诊断任务中表现出专家级性能,但在不同人口群体间存在显著的准确性差异。为确保医疗AI的公平性,作者提出了高效参数优化的公平性微调方法。Contribution: 主要贡献包括:1)提出MaxAccGap损失函数,实现跨群体准确性差异的端到端优化;2)设计了FR-LoRA、GR-LoRA和Hybrid-LoRA三种微调方法;3)展示了参数高效(仅0.24%可训练参数)的公平性优化方案。
Method: 方法基于Low-Rank Adaptation(LoRA),结合MaxAccGap损失函数和低频梯度加权(GR-LoRA)。FR-LoRA通过正则化优化公平性,GR-LoRA平衡梯度贡献,Hybrid-LoRA融合两者。
Result: 在10,000张青光眼眼底图像上评估,GR-LoRA将诊断准确性差异降低69%,总准确率达53.15%。强正则化强度和种族特异性优化进一步提升了公平性。
Insight: 研究发现,适度的正则化强度可以在公平性和准确性之间取得最优平衡,而种族特异性优化可显著减少差异(60%)。这一方法适合资源受限的医疗场景部署。
Abstract: Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms.Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.
[50] Towards Object-centric Understanding for Instructional Videos
Wenliang Guo,Yu Kong
Main category: cs.CV
TL;DR: 该论文提出了一个对象中心的理解范式,通过将动作视为驱动状态转换的机制来解决现有方法在处理灵活性程序时的不足,并引入了Object-IVQA基准和代理框架。
Details
Motivation: 现有动作中心的方法难以处理现实程序中步骤顺序随对象状态变化的问题,因此转向对象中心的理解范式以更好地建模复杂任务。Contribution: 1. 提出Object-IVQA基准,包含107个长视频和514个开放问题-答案对,评估对象中心推理的四个维度;2. 提出代理框架,整合对象中心规划、感知、分析和生成工具,支持显式证据检索和多跳推理。
Method: 通过对象状态转换建模动作,开发了一个代理框架,结合规划、感知、分析和生成工具,实现证据检索和多跳推理。
Result: 实验表明,现有的大规模视觉语言模型在对象级别识别和推理上表现不佳,而所提框架显著提升了性能。
Insight: 对象中心的视角能更有效地建模复杂任务中的动态变化,显式推理工具有助于提升模型的可解释性和性能。
Abstract: Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.
[51] EEA: Exploration-Exploitation Agent for Long Video Understanding
Te Yang,Xiangyu Zhu,Bo Wang,Quan Chen,Peng Jiang,Zhen Lei
Main category: cs.CV
TL;DR: 本文提出了一种名为EEA的新型视频代理框架,通过语义引导的层次树搜索过程平衡探索与利用,有效解决了长视频理解中的计算效率和信息覆盖问题。
Details
Motivation: 长视频理解任务需要高效的视觉数据处理能力,但现有方法在密集预处理或探索与利用平衡方面表现不佳,导致计算效率低下和信息覆盖不完整。Contribution: 1. 提出EEA框架,通过语义引导的树搜索过程实现探索与利用的平衡。2. 动态更新任务相关语义查询,并收集匹配帧作为语义锚点。3. 结合视觉语言模型的内在奖励与语义先验,通过不确定性建模实现稳定精准评估。
Method: 1. 语义引导的层次树搜索策略,优先探索语义相关帧。2. 动态更新语义查询和语义锚点。3. 结合视觉语言模型的内在奖励与语义先验,并建模不确定性。
Result: 在多个长视频基准测试中,EEA表现出卓越的性能和计算效率。
Insight: 探索与利用的平衡对长视频理解至关重要,动态语义查询和层次化搜索策略能显著提升效率和信息覆盖。
Abstract: Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.
[52] Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation
Seogkyu Jeon,Kibeom Hong,Hyeran Byun
Main category: cs.CV
TL;DR: 该论文提出了一种新的领域泛化语义分割框架DPMFormer,通过领域感知提示学习和对比学习解决视觉与文本语义对齐问题,并在多个基准测试中取得了SOTA结果。
Details
Motivation: 现有的域泛化语义分割方法依赖于固定的上下文提示,忽视了视觉与文本之间的语义错配问题,尤其在单一源域数据上表现有限。Contribution: 1. 引入领域感知提示学习以加强视觉与文本语义对齐;2. 提出领域感知对比学习和纹理扰动以多样化可观测域;3. 设计领域鲁棒一致性学习以增强模型对环境变化的适应性。
Method: DPMFormer框架结合了领域感知提示学习、对比学习和一致性学习,并通过纹理扰动增强数据多样性。
Result: 在多个DGSS基准测试中达到了新的SOTA性能。
Insight: 通过动态调整上下文提示和多样化数据增强,可以有效提升模型在未知领域的泛化能力。
Abstract: Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.
[53] AfroBeats Dance Movement Analysis Using Computer Vision: A Proof-of-Concept Framework Combining YOLO and Segment Anything Model
Kwaku Opoku-Ware,Gideon Opoku
Main category: cs.CV
TL;DR: 该论文提出了一种结合YOLO和Segment Anything Model (SAM)的计算机视觉框架,用于自动化分析非洲节奏舞蹈动作。通过检测和分割舞者,系统统计舞步、运动强度和空间覆盖率,初步验证了技术可行性,但存在单样本验证和缺乏基准对比的局限性。
Details
Motivation: 研究旨在探索无需专业设备或标记的视频舞蹈动作分析方法,为舞蹈量化指标提供新的技术路径。Contribution: 主要贡献包括:1) 提出结合YOLO和SAM的舞蹈动作分析框架;2) 实现舞步统计、运动强度和空间覆盖率量化;3) 初步验证技术的可行性。
Method: 方法包含两部分:1) 使用YOLOv8和v11检测舞者;2) 利用SAM进行像素级分割,实现运动跟踪和量化。
Result: 在49秒的非洲节奏舞蹈视频中,系统检测精度为94%,召回率为89%;SAM分割的IoU达到83%,量化结果显示主舞者运动强度和空间使用率显著高于次舞者。
Insight: 像素级分割能捕捉更精细的身体动作变化,为舞蹈量化研究提供了新方向;未来需系统性验证和多方法对比以提升鲁棒性。
Abstract: This paper presents a preliminary investigation into automated dance movement analysis using contemporary computer vision techniques. We propose a proof-of-concept framework that integrates YOLOv8 and v11 for dancer detection with the Segment Anything Model (SAM) for precise segmentation, enabling the tracking and quantification of dancer movements in video recordings without specialized equipment or markers. Our approach identifies dancers within video frames, counts discrete dance steps, calculates spatial coverage patterns, and measures rhythm consistency across performance sequences. Testing this framework on a single 49-second recording of Ghanaian AfroBeats dance demonstrates technical feasibility, with the system achieving approximately 94% detection precision and 89% recall on manually inspected samples. The pixel-level segmentation provided by SAM, achieving approximately 83% intersection-over-union with visual inspection, enables motion quantification that captures body configuration changes beyond what bounding-box approaches can represent. Analysis of this preliminary case study indicates that the dancer classified as primary by our system executed 23% more steps with 37% higher motion intensity and utilized 42% more performance space compared to dancers classified as secondary. However, this work represents an early-stage investigation with substantial limitations including single-video validation, absence of systematic ground truth annotations, and lack of comparison with existing pose estimation methods. We present this framework to demonstrate technical feasibility, identify promising directions for quantitative dance metrics, and establish a foundation for future systematic validation studies.
[54] CSMapping: Scalable Crowdsourced Semantic Mapping and Topology Inference for Autonomous Driving
Zhijian Qiao,Zehuan Yu,Tong Li,Chih-Chung Chou,Wenchao Ding,Shaojie Shen
Main category: cs.CV
TL;DR: CSMapping是一个用于自动驾驶的可扩展众包语义建图和拓扑推理系统,通过隐扩散模型和优化方法解决了低成本传感器噪声问题,语义建图和拓扑建图性能均达到SOTA。
Details
Motivation: 众包建图在自动驾驶中具有可扩展性,但低成本传感器的噪声限制了数据量增加带来的质量提升,因此需要一种能随数据量增加持续提升质量的系统。Contribution: 1. 提出了CSMapping系统,能够生成高质量的语义地图和拓扑道路中心线;2. 采用隐扩散模型学习地图结构的生成先验,结合约束优化提升噪声鲁棒性和补全能力;3. 通过聚类和运动学优化生成平滑的道路中心线。
Method: 1. 语义建图:使用隐扩散模型学习地图结构的生成先验,通过约束优化在隐空间实现鲁棒性和补全;2. 拓扑建图:采用置信加权k-medoids聚类和运动学优化从轨迹中提取平滑中心线。
Result: 在nuScenes、Argoverse 2和大型专有数据集上的实验表明,CSMapping在语义和拓扑建图上均达到了SOTA性能。
Insight: 1. 隐扩散模型能够有效学习地图结构的先验知识,无需成对的众包/高清地图监督;2. 优化方法和全局一致性设计对噪声鲁棒性和地图补全至关重要。
Abstract: Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.
[55] FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
Yiyi Cai,Yuhan Wu,Kunhang Li,You Zhou,Bo Zheng,Haiyang Liu
Main category: cs.CV
TL;DR: FloodDiffusion是一种新颖的文本驱动流式运动生成框架,通过改进的扩散强迫方法实现高质量、实时的运动序列生成。
Details
Motivation: 现有方法在流式运动生成中面临实时性和连续性挑战,FloodDiffusion旨在通过改进的扩散强迫框架解决这些问题。Contribution: 提出FloodDiffusion框架,首次展示扩散强迫方法在流式运动生成任务中的SOTA性能;改进包括双向注意力、时间调度器和连续文本条件引入。
Method: 采用扩散强迫框架,提出三点改进:(1)双向注意力替代因果注意力;(2)下三角时间调度器替代随机调度;(3)连续时间变化的文本条件引入。
Result: 在HumanML3D基准测试中达到FID 0.057的SOTA性能,实现了高质量、实时的运动生成。
Insight: 扩散强迫框架需针对流式任务定制设计,双向注意力和时间调度器是保证生成质量的关键。
Abstract: We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/
[56] OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
Zhishan Zhou,Siyuan Wei,Zengran Wang,Chunjie Wang,Xiaosheng Yan,Xiao Liu
Main category: cs.CV
TL;DR: OpenTrack3D是一种通用的开放词汇3D实例分割框架,通过视觉-空间跟踪器和多模态大语言模型解决了现有方法的局限性,实现了高性能和强泛化能力。
Details
Motivation: 现有开放词汇3D实例分割方法依赖于数据集特定的提案网络或基于网格的超点,且CLIP分类器的文本推理能力较弱,限制了在无网格场景和新颖场景中的应用,OpenTrack3D旨在解决这些问题。Contribution: 1.提出了一个无网格的视觉-空间跟踪器,实时构建跨视图一致的对象提案;2.引入多模态大语言模型(MLLM)替代CLIP,增强了对复杂用户查询的组合推理能力;3.在多个基准测试中展示了最先进的性能和泛化能力。
Method: 1.使用2D开放词汇分割器生成掩码,并通过深度信息将其提升到3D点云;2.提取掩码引导的实例特征,融合视觉和空间线索以保持实例一致性;3.可选超点细化模块进一步提升性能;4.采用MLLM增强文本推理能力。
Result: 在ScanNet200、Replica、ScanNet++和SceneFun3D等多个基准测试中取得了最先进的性能,展示了强大的泛化能力。
Insight: OpenTrack3D通过融合实时跟踪和多模态语言模型,不仅在无网格场景中表现优异,还能灵活应用于复杂查询任务,为机器人学和AR/VR领域提供了新的技术路径。
Abstract: Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
[57] Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
Subin Kim,Sangwoo Mo,Mamshad Nayeem Rizve,Yiran Xu,Difan Liu,Jinwoo Shin,Tobias Hinz
Main category: cs.CV
TL;DR: PRIS是一个在文本到视觉生成中动态调整提示(prompt)的框架,通过分析生成结果中的失败模式并重新设计提示,显著提升了生成质量,尤其在细粒度评估指标上表现优异。
Details
Motivation: 现有的文本到视觉生成方法主要通过增加采样步骤或种子数量来提升生成质量,但由于提示固定,效果容易达到瓶颈。PRIS旨在通过动态调整提示来解决这一问题。Contribution: 1. 提出PRIS框架,动态调整提示以提升生成质量。2. 引入细粒度的元素级事实校正验证器,提供更精确的生成对齐反馈。
Method: PRIS在推理阶段分析生成结果中的失败模式,动态调整提示,并结合细粒度的元素级评估方法验证对齐效果。
Result: 在文本到图像和文本到视频任务中,PRIS取得了显著提升,如在VBench 2.0上实现了15%的增益。
Insight: 联合优化提示和生成过程能更有效地利用推理阶段的扩展规律,显著提升生成质量。
Abstract: Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.
[58] CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding
Huy Quang Ung,Guillaume Habault,Yasutaka Nishimura,Hao Niu,Roberto Legaspi,Tomoki Oya,Ryoichi Kojima,Masato Taya,Chihiro Ono,Atsunori Minamikawa,Yan Liu
Main category: cs.CV
TL;DR: 论文介绍了CartoMapQA,一个专门评估视觉语言模型(LVLMs)在地图理解能力的基准数据集,包含2000多个样本,覆盖多层次的解读技能。评估发现LVLMs在地图语义和地理推理方面存在显著不足。
Details
Motivation: 尽管视觉语言模型在多模态任务中表现出色,但其在地图解读方面的能力尚未被充分研究。CartoMapQA的提出填补了这一空白,旨在推动LVLMs在地图理解领域的进步。Contribution: 1) 提出了CartoMapQA基准数据集,专注于地图理解的评估;2) 设计了多层次的问答任务;3) 揭示了当前LVLMs在地图语义、地理推理和OCR错误等方面的不足。
Method: 通过构建包含2000多个样本的数据集,涵盖符号识别、信息提取、比例尺解读和路线推理等任务,对开源和商业LVLMs进行评估。
Result: 评估结果显示,LVLMs在地图特定语义理解和地理推理能力方面表现不佳,且易受OCR相关错误影响。
Insight: CartoMapQA为未来改进LVLMs在地图理解方面的能力提供了重要工具,尤其是在导航和城市规划等实际应用中。
Abstract: The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs’ understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git
[59] V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention
Nan Sun,Zhenyu Zhang,Xixun Lin,Kun Wang,Yanmin Shang,Naibin Gu,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang,Yanan Cao
Main category: cs.CV
TL;DR: V-ITI is a lightweight framework designed to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by detecting visual neglect dynamically and intervening only when necessary.
Details
Motivation: MLLMs often suffer from hallucinations due to visual neglect, where they fail to prioritize input images, leading to unreliable outputs. Existing methods focus on 'how to intervene' but ignore 'when to intervene', causing over-intervention.Contribution: The paper introduces V-ITI, a framework combining a Visual Neglect Detector and a Visual Recall Intervenor to dynamically detect and mitigate hallucinations without unnecessary computational overhead.
Method: V-ITI detects visual neglect via head-level activation patterns and modulates activations using prestored visual information only when neglect is detected.
Result: Extensive experiments show V-ITI effectively reduces hallucinations across benchmarks while maintaining general task performance.
Insight: Focusing on ‘when to intervene’ rather than just ‘how to intervene’ is crucial for mitigating hallucinations efficiently.
Abstract: Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on “how to intervene” but overlooking the prerequisite “when to intervene”, which leads to the “over-intervention” problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
[60] Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching
Wei Chee Yew,Hailun Xu,Sanjay Saha,Xiaotian Fan,Hiok Hian Ong,David Yuchen Wang,Kanchan Sarkar,Zhenheng Yang,Danhui Guan
Main category: cs.CV
TL;DR: 论文提出了一种混合内容审核框架,结合监督分类和MLLM增强的相似性匹配,用于直播中的动态内容审核,显著降低了用户接触不良内容的概率。
Details
Motivation: 直播平台的内容审核面临实时性、多模态和对抗性内容的挑战,传统分类器难以应对新兴或隐蔽的违规行为,因此需要一种更灵活和鲁棒的解决方案。Contribution: 提出了一种混合审核框架:监督分类处理已知违规,MLLM增强的相似性匹配检测新颖或隐蔽案例,实现了多模态输入的高效处理和高精度审核。
Method: 通过双管道设计,结合监督分类和参考相似性匹配,利用MLLM(多模态大语言模型)提升轻量推理的准确性。
Result: 分类管道召回率67%(精度80%),相似性管道召回率76%(精度80%),A/B测试显示不良直播观看减少6-8%。
Insight: 混合方法结合了分类的效率和相似性匹配的灵活性,MLLM的知识蒸馏提升了整体性能,适合动态和对抗性环境。
Abstract: Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.
[61] Optical Context Compression Is Just (Bad) Autoencoding
Ivan Yee Lee,Cheng Yang,Taylor Berg-Kirkpatrick
Main category: cs.CV
TL;DR: 该论文质疑基于视觉的上下文压缩在语言建模中的实际价值,通过实验证明简单的替代方法在文本重建和语言建模任务中表现优于视觉编码器。
Details
Motivation: DeepSeek-OCR的研究表明可以通过少量视觉标记高保真重建文本,但这种压缩方式是否有助于语言建模尚未验证。论文旨在验证视觉压缩的两个假设:1)视觉压缩在文本重建中具有独特优势;2)视觉压缩对语言建模有用。Contribution: 1) 验证视觉压缩方法在语言建模中的实际效果;2) 发现简单方法(如均值池化和分层编码器)在重建和语言建模任务中表现优于视觉编码器;3) 指出当前对视觉压缩的期望超越了实际证据。
Method: 通过对比实验,比较DeepSeek-OCR的视觉编码器与参数较少的替代方法(均值池化和分层编码器),在相同压缩比下评估文本重建效果,并在语言建模任务中测试压缩表示的效果。
Result: 实验结果:1) 替代方法在文本重建任务中表现与视觉编码器相当或更好;2) 在语言建模任务中,视觉压缩未能超越简单的截断方法;3) 视觉压缩的语言建模效果不及预期。
Insight: 1) 视觉压缩的优势可能被高估;2) 简单的非视觉方法在某些任务中更具竞争力;3) 需要更多证据支持视觉压缩在语言建模中的应用。
Abstract: DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR’s reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives–parameter-free mean pooling and a learned hierarchical encoder–we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling–where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
[62] Thinking with Programming Vision: Towards a Unified View for Thinking with Images
Zirun Guo,Minjie Hong,Feng Zhang,Kai Jia,Tao Jin
Main category: cs.CV
TL;DR: 论文提出了一种名为CodeVision的灵活、可扩展的代码工具框架,通过生成代码作为通用接口来调用图像操作,解决了现有多模态大语言模型(MLLMs)在图像方向变化或自然损坏时的性能退化问题。
Details
Motivation: 现有MLLMs在图像处理工具上的局限性(工具种类有限、扩展性差)及对简单图像变化的鲁棒性不足,促使作者提出一种更通用、灵活的方法。Contribution: 1. 提出了CodeVision框架,用生成代码的方式扩展工具调用能力;2. 设计了两阶段训练方法(SFT+RL)和密集过程奖励函数;3. 构建了新的数据集和基准测试套件以评估鲁棒性和多工具推理能力。
Method: 1. 使用高质量数据集进行监督微调(SFT),支持复杂多轮工具组合和错误恢复;2. 通过强化学习(RL)结合密集过程奖励函数,优化工具使用策略和效率。
Result: 在Qwen2.5-VL和Qwen3-VL系列上的实验表明,CodeVision显著提升了模型性能,并涌现出灵活工具组合、高效链式执行和运行时错误恢复等新能力。
Insight: 代码作为通用接口的范式扩展性强,且RL中的密集过程奖励对复杂任务推理至关重要。
Abstract: Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.
[63] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Zichuan Lin,Yicheng Liu,Yang Yang,Lvfang Tao,Deheng Ye
Main category: cs.CV
TL;DR: AdaptVision 是一种高效的视觉-语言模型(VLM),通过自适应视觉标记获取以减少计算开销,使用强化学习框架和 DTPO 算法在任务需求和效率之间取得平衡。
Details
Motivation: 现有的高效 VLM 方法通过固定比例压缩视觉标记,缺乏对任务需求的自适应能力,无法动态调整视觉标记数量。AdaptVision 旨在解决这一问题。Contribution: 1. 提出了一种自适应视觉标记获取方法,通过从低分辨率图像开始并选择性裁剪关键区域;2. 设计了 DTPO 算法,将学习目标解耦为工具学习和准确性提升,提升了优化效果。
Method: 1. 使用强化学习框架(DTPO)平衡准确性和效率;2. 从粗到细获取视觉信息,动态调用边界框工具裁剪关键区域;3. 解耦优势估计以优化不同学习目标。
Result: 在多个 VQA 基准测试中,AdaptVision 显著减少了视觉标记数量,同时性能优于现有高效 VLM 方法。
Insight: 自适应视觉标记获取和强化学习的结合可以显著提升 VLM 的效率,同时保持高准确性,为未来高效模型设计提供了新思路。
Abstract: Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
[64] Stable Signer: Hierarchical Sign Language Generative Model
Sen Fang,Yalin Feng,Hongbin Zhong,Yanxin Zhang,Dimitris N. Metaxas
Main category: cs.CV
TL;DR: 论文提出了一种名为Stable Signer的分层手语生成模型,通过简化和优化任务目标,重新定义了手语生成任务为端到端的层次结构,显著提升了生成质量。
Details
Motivation: 目前手语生成领域进展缓慢,主要原因是文本转换、姿态生成和姿态渲染到真实视频的过程中存在逐渐累积的误差。Contribution: 1. 设计了一个新的手语生成模型Stable Signer,简化了任务结构;2. 提出了Sign Language Understanding Linker (SLUL)和SLP-MoE手势渲染专家模块;3. 引入了Semantic-Aware Gloss Masking Loss (SAGM Loss)。
Method: 1. 将手语生成任务重新定义为端到端的分层生成任务(Prompt2Gloss, Text2Gloss和Pose2Vid);2. 使用SLUL执行文本理解;3. 通过SLP-MoE模块生成手势。
Result: 性能较当前最先进方法提升了48.6%,能够生成高质量、多风格的手语视频。
Insight: 通过简化和优化任务结构,直接减少误差累积,显著提升了生成效果;端到端的分层设计为手语生成任务提供了新的思路。
Abstract: Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.
[65] UniComp: Rethinking Video Compression Through Informational Uniqueness
Chao Yuan,Shimin Chen,Minliang Lin,Limeng Qiao,Guanglu Wan,Lin Ma
Main category: cs.CV
TL;DR: UniComp提出了一种基于信息独特性的视频压缩框架,通过最小化条件熵(重建误差)优化视频表示的信息保真度,设计了三个模块逐步实现语义帧分组、自适应资源分配和细粒度空间压缩,实验表明其性能优于现有压缩方法。
Details
Motivation: 现有基于注意力的视频压缩方法未能充分挖掘视频中的内在信息冗余,UniComp从信息论角度出发,提出信息独特性的概念来解决这一问题。Contribution: 1. 提出信息独特性的概念来衡量token间的内在冗余;2. 设计了三个模块(Frame Group Fusion, Token Allocation, Spatial Dynamic Compression)逐步优化压缩过程;3. 在有限计算预算下优于现有压缩方法。
Method: 1. 从信息论角度将压缩问题形式化为最小化条件熵;2. 通过信息独特性量化token冗余;3. 三个模块逐步实现帧分组、资源分配和空间压缩。
Result: UniComp在有限计算预算下显著优于现有压缩方法,能够更好地保留关键视觉信息。
Insight: 信息独特性是视频压缩中衡量冗余和优化token分配的有效指标。
Abstract: Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules-Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression-that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.
[66] Dynamic Optical Test for Bot Identification (DOT-BI): A simple check to identify bots in surveys and online processes
Malte Bleeker,Mauro Gotsch
Main category: cs.CV
TL;DR: 该论文提出了DOT-BI方法,利用人类对运动的感知能力快速区分人类和机器人在调查和在线流程中的参与。通过隐藏数字的动态显示,结合随机黑白像素纹理背景,仅人类能感知数字,而算法无法提取有效信号。初步测试表明,现有视频和多模态模型无法完成任务,而人类参与者表现优异。
Details
Motivation: 在线调查和流程中,机器人自动化系统的干扰日益严重,亟需一种简单有效的方法区分人类和机器人,确保数据的真实性和可靠性。Contribution: 提出了DOT-BI方法:一种基于人类运动感知的动态光学测试,能高效识别机器人;验证了该方法对人类用户的友好性和对机器人的防御能力;开源了测试生成代码和预渲染变体。
Method: DOT-BI通过在随机黑白像素纹理背景中隐藏动态数字,利用人类对运动和尺度的感知能力完成任务。机器人无法通过逐帧算法处理获取有效信号。
Result: 实验中,顶尖视频和多模态模型(如GPT-5-Thinking和Gemini 2.5 Pro)未能提取正确数字;人类参与者(99.5%)则高效完成任务(平均10.7秒),且感知易用性与对照组无显著差异。
Insight: DOT-BI展示了利用人类感知特性对抗机器人的潜力,为在线身份验证提供了新颖且实用的思路。
Abstract: We propose the Dynamic Optical Test for Bot Identification (DOT-BI): a quick and easy method that uses human perception of motion to differentiate between human respondents and automated systems in surveys and online processes. In DOT-BI, a ‘hidden’ number is displayed with the same random black-and-white pixel texture as its background. Only the difference in motion and scale between the number and the background makes the number perceptible to humans across frames, while frame-by-frame algorithmic processing yields no meaningful signal. We conducted two preliminary assessments. Firstly, state-of-the-art, video-capable, multimodal models (GPT-5-Thinking and Gemini 2.5 Pro) fail to extract the correct value, even when given explicit instructions about the mechanism. Secondly, in an online survey (n=182), 99.5% (181/182) of participants solved the task, with an average end-to-end completion time of 10.7 seconds; a supervised lab study (n=39) found no negative effects on perceived ease-of-use or completion time relative to a control. We release code to generate tests and 100+ pre-rendered variants to facilitate adoption in surveys and online processes.
[67] Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation
Yuchen Deng,Xiuyang Wu,Hai-Tao Zheng,Jie Wang,Feidiao Yang,Yuxing Han
Main category: cs.CV
TL;DR: BBF提出了一种基于音频-视觉语义引导的上下文感知视频插值框架,通过多模态条件输入和解耦融合机制,提升插帧质量,尤其在音频-视觉同步插值任务中表现优于现有方法。
Details
Motivation: 现有视频插值方法(如光流或扩散模型)难以处理快速、复杂和非线性运动,且在音频-视觉同步插值等细粒度任务中效果不佳。BBF旨在通过多模态语义引导解决这些问题。Contribution: 1. 设计支持多模态条件输入的插值模型;
2. 提出解耦多模态融合机制,分阶段注入条件信号;
3. 采用渐进式多阶段训练策略,动态调整采样和损失。
Method: 1. 多模态条件输入设计(文本、音频、图像、视频);
2. 基于DiT主干的解耦多模态融合机制;
3. 渐进式训练策略,利用起始帧差异嵌入动态优化模型。
Result: BBF在通用视频插值和音频-视觉同步插值任务中均优于现有方法,验证了其统一框架的有效性。
Insight: 多模态语义引导和解耦融合机制能显著提升插帧的上下文感知能力,尤其在复杂运动场景下表现优异。
Abstract: Handling fast, complex, and highly non-linear motion patterns has long posed challenges for video frame interpolation. Although recent diffusion-based approaches improve upon traditional optical-flow-based methods, they still struggle to cover diverse application scenarios and often fail to produce sharp, temporally consistent frames in fine-grained motion tasks such as audio-visual synchronized interpolation. To address these limitations, we introduce BBF (Beyond Boundary Frames), a context-aware video frame interpolation framework, which could be guided by audio/visual semantics. First, we enhance the input design of the interpolation model so that it can flexibly handle multiple conditional modalities, including text, audio, images, and video. Second, we propose a decoupled multimodal fusion mechanism that sequentially injects different conditional signals into a DiT backbone. Finally, to maintain the generation abilities of the foundation model, we adopt a progressive multi-stage training paradigm, where the start-end frame difference embedding is used to dynamically adjust both the data sampling and the loss weighting. Extensive experimental results demonstrate that BBF outperforms specialized state-of-the-art methods on both generic interpolation and audio-visual synchronized interpolation tasks, establishing a unified framework for video frame interpolation under coordinated multi-channel conditioning.
[68] CloseUpAvatar: High-Fidelity Animatable Full-Body Avatars with Mixture of Multi-Scale Textures
David Svitov,Pietro Morerio,Lourdes Agapito,Alessio Del Bue
Main category: cs.CV
TL;DR: CloseUpAvatar提出了一种新的可动画全身虚拟人表示方法,通过混合多尺度纹理解决了广泛相机运动下的渲染质量问题,同时在近距离视角下保持高质量渲染。
Details
Motivation: 现有方法在处理广泛的相机运动时,难以同时在近距离和远距离视角下保持高质量的渲染效果,因此需要一种能够根据相机距离动态调整渲染质量的解决方案。Contribution: 主要贡献包括:1) 提出了一种混合多尺度纹理的虚拟人表示方法;2) 能够根据相机距离自动切换高低频纹理,实现动态渲染质量调整;3) 在广泛相机视角下保持高性能渲染(高FPS)。
Method: 方法核心是使用两组可学习的纹理(低频和高频),并根据相机距离动态调整纹理的使用频率。虚拟人被表示为一组带纹理的平面,相机接近时启用高频纹理,远离时逐渐降低其影响。
Result: 在ActorsHQ数据集上的实验表明,CloseUpAvatar在广泛的相机视角下均优于现有方法,同时保持了高渲染性能(高FPS)。
Insight: 动态调整纹理细节是实现高质量渲染的关键,特别是在处理广泛相机运动时。混合多尺度纹理的表示方法可以平衡渲染质量和计算效率。
Abstract: We present a CloseUpAvatar - a novel approach for articulated human avatar representation dealing with more general camera motions, while preserving rendering quality for close-up views. CloseUpAvatar represents an avatar as a set of textured planes with two sets of learnable textures for low and high-frequency detail. The method automatically switches to high-frequency textures only for cameras positioned close to the avatar’s surface and gradually reduces their impact as the camera moves farther away. Such parametrization of the avatar enables CloseUpAvatar to adjust rendering quality based on camera distance ensuring realistic rendering across a wider range of camera orientations than previous approaches. We provide experiments using the ActorsHQ dataset with high-resolution input images. CloseUpAvatar demonstrates both qualitative and quantitative improvements over existing methods in rendering from novel wide range camera positions, while maintaining high FPS by limiting the number of required primitives.
[69] HBFormer: A Hybrid-Bridge Transformer for Microtumor and Miniature Organ Segmentation
Fuchen Zheng,Xinyi Chen,Weixuan Li,Quanjun Li,Junhua Zhou,Xiaojiao Guo,Xuhang Chen,Chi-Man Pun,Shoujun Zhou
Main category: cs.CV
TL;DR: HBFormer是一种混合桥梁Transformer架构,通过结合U型编码器-解码器框架和Swin Transformer主干网络,提出了一种创新的多尺度特征融合(MFF)解码器,用于解决医学图像分割中局部细节与全局上下文融合不足的问题。
Details
Motivation: 现有的基于窗口自注意力的Vision Transformer在医学图像分割中表现优异,但其局部注意力机制难以有效融合局部细节与全局上下文,特别是在微肿瘤和小型器官分割任务中表现不足。Contribution: HBFormer的创新在于其混合桥梁设计,结合了U型框架和Swin Transformer,并提出MFF解码器,通过通道和空间注意力模块融合多尺度特征,显著提升了分割精度。
Method: HBFormer采用基于Swin Transformer的主干网络提取特征,并通过MFF解码器融合多尺度特征。MFF解码器使用一系列空洞卷积和深度可分离卷积构建通道与空间注意力模块,显式建模长距离依赖并优化边界。
Result: HBFormer在多器官、肝肿瘤和膀胱肿瘤分割基准测试中达到最先进水平,展示了其在微肿瘤和小型器官分割上的卓越能力。
Insight: HBFormer的成功表明,结合局部分辨率和全局上下文的特征融合机制在医学图像分割中具有巨大潜力,尤其是在处理细微结构时。
Abstract: Medical image segmentation is a cornerstone of modern clinical diagnostics. While Vision Transformers that leverage shifted window-based self-attention have established new benchmarks in this field, they are often hampered by a critical limitation: their localized attention mechanism struggles to effectively fuse local details with global context. This deficiency is particularly detrimental to challenging tasks such as the segmentation of microtumors and miniature organs, where both fine-grained boundary definition and broad contextual understanding are paramount. To address this gap, we propose HBFormer, a novel Hybrid-Bridge Transformer architecture. The ‘Hybrid’ design of HBFormer synergizes a classic U-shaped encoder-decoder framework with a powerful Swin Transformer backbone for robust hierarchical feature extraction. The core innovation lies in its ‘Bridge’ mechanism, a sophisticated nexus for multi-scale feature integration. This bridge is architecturally embodied by our novel Multi-Scale Feature Fusion (MFF) decoder. Departing from conventional symmetric designs, the MFF decoder is engineered to fuse multi-scale features from the encoder with global contextual information. It achieves this through a synergistic combination of channel and spatial attention modules, which are constructed from a series of dilated and depth-wise convolutions. These components work in concert to create a powerful feature bridge that explicitly captures long-range dependencies and refines object boundaries with exceptional precision. Comprehensive experiments on challenging medical image segmentation datasets, including multi-organ, liver tumor, and bladder tumor benchmarks, demonstrate that HBFormer achieves state-of-the-art results, showcasing its outstanding capabilities in microtumor and miniature organ segmentation. Code and models are available at: https://github.com/lzeeorno/HBFormer.
[70] Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding
Haoran Zhou,Gim Hee Lee
Main category: cs.CV
TL;DR: Motion4D是一个新框架,通过将2D先验整合到4D高斯溅射表示中,解决了动态场景理解的3D一致性问题。它采用两阶段优化和3D置信度图,显著提升了性能。
Details
Motivation: 现有的2D基础模型在处理动态场景时缺乏3D一致性,导致空间错位和时间闪烁问题。Motion4D旨在解决这些问题。Contribution: 提出了Motion4D框架,结合2D先验和4D高斯溅射表示,设计了迭代优化方法和3D置信度图,提升了动态场景理解的准确性和一致性。
Method: 采用序贯优化和全局优化的两阶段框架,引入3D置信度图和自适应重采样,并迭代优化语义场和提示。
Result: Motion4D在点跟踪、视频对象分割和新视角合成等任务上显著优于现有方法。
Insight: 通过动态调整3D置信度和语义一致性优化,Motion4D展示了在复杂3D环境中实现高效动态场景理解的潜力。
Abstract: Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under-represented regions based on per-pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis. Our code is available at https://hrzhou2.github.io/motion4d-web/.
[71] LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Muhammed Burak Kizil,Enes Sanli,Niloy J. Mitra,Erkut Erdem,Aykut Erdem,Duygu Ceylan
Main category: cs.CV
TL;DR: LAMP利用大语言模型(LLM)作为运动规划器,将自然语言描述转化为3D轨迹,用于动态对象和摄像机控制,提升了视频生成的运动可控性和用户意图对齐。
Details
Motivation: 现有视频生成的运动控制接口有限,难以满足复杂电影场景的需求。LAMP旨在通过自然语言直接控制对象和摄像机的运动。Contribution: 1. 提出了LAMP框架,首次实现直接通过自然语言生成对象和摄像机运动的3D轨迹。2. 设计了一种运动领域特定语言(DSL),并利用LLM的程序合成能力生成结构化的运动程序。3. 构建了一个大规模的程序化数据集,配对自然语言描述和运动程序与3D轨迹。
Method: 1. 定义运动DSL,受电影摄影惯例启发。2. 利用LLM将自然语言转换为结构化的运动程序。3. 将运动程序确定性地映射到3D轨迹。
Result: 实验表明,LAMP在运动可控性和用户意图对齐上优于现有方法。
Insight: 1. LLM可以有效地用于视频生成的运动规划任务。2. 程序化方法(DSL)为复杂运动控制提供了结构化解决方案。
Abstract: Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP’s improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications.
[72] ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation
Yaokun Li,Shuaixian Wang,Mantang Guo,Jiehui Huang,Taojun Ding,Mu Hu,Kaixuan Wang,Shaojie Shen,Guang Tan
Main category: cs.CV
TL;DR: ReCamDriving提出了一种基于纯视觉的、相机控制的新轨迹视频生成框架,通过两阶段训练和3DGS渲染实现精确的相机控制。
Details
Motivation: 现有的修复方法难以处理复杂伪影,而LiDAR方法依赖稀疏且不完整的线索。ReCamDriving希望通过密集且完整的3DGS渲染提供显式几何引导,提升生成质量。Contribution: 1. 提出两阶段训练范式,结合3DGS渲染实现精细控制;2. 设计了基于3DGS的跨轨迹数据策展策略,消除训练-测试差距;3. 构建了ParaDrive数据集,包含11万对并行轨迹视频。
Method: 采用两阶段训练:第一阶段用相机位姿进行粗控制,第二阶段结合3DGS渲染进行精细几何引导。使用3DGS渲染提供显式几何信息。
Result: 实验表明,ReCamDriving在相机可控性和结构一致性上达到了最先进水平。
Insight: 3DGS渲染的密集几何信息显著提升了视频生成的精确性和控制能力;数据策展策略有助于消除训练与测试的模式差异。
Abstract: We propose ReCamDriving, a purely vision-based, camera-controlled novel-trajectory video generation framework. While repair-based methods fail to restore complex artifacts and LiDAR-based approaches rely on sparse and incomplete cues, ReCamDriving leverages dense and scene-complete 3DGS renderings for explicit geometric guidance, achieving precise camera-controllable generation. To mitigate overfitting to restoration behaviors when conditioned on 3DGS renderings, ReCamDriving adopts a two-stage training paradigm: the first stage uses camera poses for coarse control, while the second stage incorporates 3DGS renderings for fine-grained viewpoint and geometric guidance. Furthermore, we present a 3DGS-based cross-trajectory data curation strategy to eliminate the train-test gap in camera transformation patterns, enabling scalable multi-trajectory supervision from monocular videos. Based on this strategy, we construct the ParaDrive dataset, containing over 110K parallel-trajectory video pairs. Extensive experiments demonstrate that ReCamDriving achieves state-of-the-art camera controllability and structural consistency.
[73] MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms
Jiahao Zhang,Xiao Zhao,Guangyu Gao
Main category: cs.CV
TL;DR: MKSNet 提出了一种新颖的多核选择和双注意力机制网络,用于解决遥感图像中小目标检测的挑战,显著提升了检测性能。
Details
Motivation: 遥感图像中高分辨率和小目标的特性导致传统 CNN 在深层丢失关键信息,且复杂的背景信息干扰了小目标的检测。Contribution: 1. 设计了多核选择机制(MKS),自适应调整卷积核大小以捕捉上下文信息;2. 引入了双注意力机制(空间和通道注意力),优化特征表示;3. 在 DOTA-v1.0 和 HRSC2016 基准测试中实现了显著优于现有方法的性能。
Method: MKSNet 结合多核选择和双注意力机制:MKS 机制利用大卷积核捕获上下文信息并自适应选择核大小;空间注意力模块调整特征图的空间权重,通道注意力模块优化通道信息选择。
Result: 在 DOTA-v1.0 和 HRSC2016 数据集上的实验表明,MKSNet 在小目标检测任务中显著优于现有方法。
Insight: MKSNet 的创新在于动态适应不同尺度的目标和复杂背景,通过多核选择和注意力机制提升小目标的检测能力。
Abstract: Deep convolutional neural networks (DCNNs) have substantially advanced object detection capabilities, particularly in remote sensing imagery. However, challenges persist, especially in detecting small objects where the high resolution of these images and the small size of target objects often result in a loss of critical information in the deeper layers of conventional CNNs. Additionally, the extensive spatial redundancy and intricate background details typical in remote-sensing images tend to obscure these small targets. To address these challenges, we introduce Multi-Kernel Selection Network (MKSNet), a novel network architecture featuring a novel Multi-Kernel Selection mechanism. The MKS mechanism utilizes large convolutional kernels to effectively capture an extensive range of contextual information. This innovative design allows for adaptive kernel size selection, significantly enhancing the network’s ability to dynamically process and emphasize crucial spatial details for small object detection. Furthermore, MKSNet also incorporates a dual attention mechanism, merging spatial and channel attention modules. The spatial attention module adaptively fine-tunes the spatial weights of feature maps, focusing more intensively on relevant regions while mitigating background noise. Simultaneously, the channel attention module optimizes channel information selection, improving feature representation and detection accuracy. Empirical evaluations on the DOTA-v1.0 and HRSC2016 benchmark demonstrate that MKSNet substantially surpasses existing state-of-the-art models in detecting small objects in remote sensing images. These results highlight MKSNet’s superior ability to manage the complexities associated with multi-scale and high-resolution image data, confirming its effectiveness and innovation in remote sensing object detection.
[74] Multi-Scale Visual Prompting for Lightweight Small-Image Classification
Salim Khazem
Main category: cs.CV
TL;DR: 本文提出了一种名为多尺度视觉提示(MSVP)的轻量级方法,用于小图像分类任务,通过融合全局、中尺度和局部提示映射显著提升性能。
Details
Motivation: 现有视觉提示方法主要针对大型视觉Transformer和高分辨率数据集(如ImageNet),而小图像基准(如MNIST、Fashion-MNIST和CIFAR-10)在教育和研究中广泛使用,却很少被研究。本文旨在填补这一空白。Contribution: 提出了多尺度视觉提示(MSVP),一种轻量级且通用的方法,通过1×1卷积融合多尺度提示映射,适用于不同骨干网络(CNN和ViT),参数增量极小(<0.02%)。
Method: MSVP通过学习全局、中尺度和局部提示映射,与输入图像融合,并通过轻量级1×1卷积实现。该方法在各种骨干网络上进行了统一测试。
Result: 实验表明,MSVP在MNIST、Fashion-MNIST和CIFAR-10基准上显著提升性能,且计算开销可以忽略不计。
Insight: 多尺度提示为低分辨率图像提供了有效的归纳偏差,即使在小图像任务中也能显著改进模型性能。
Abstract: Visual prompting has recently emerged as an efficient strategy to adapt vision models using lightweight, learnable parameters injected into the input space. However, prior work mainly targets large Vision Transformers and high-resolution datasets such as ImageNet. In contrast, small-image benchmarks like MNIST, Fashion-MNIST, and CIFAR-10 remain widely used in education, prototyping, and research, yet have received little attention in the context of prompting. In this paper, we introduce \textbf{Multi-Scale Visual Prompting (MSVP)}, a simple and generic module that learns a set of global, mid-scale, and local prompt maps fused with the input image via a lightweight $1 \times 1$ convolution. MSVP is backbone-agnostic, adds less than $0.02%$ parameters, and significantly improves performance across CNN and Vision Transformer backbones. We provide a unified benchmark on MNIST, Fashion-MNIST, and CIFAR-10 using a simple CNN, ResNet-18, and a small Vision Transformer. Our method yields consistent improvements with negligible computational overhead. We further include ablations on prompt scales, fusion strategies, and backbone architectures, along with qualitative analyzes using prompt visualizations and Grad-CAM. Our results demonstrate that multi-scale prompting provides an effective inductive bias even on low-resolution images.
[75] ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
Qi’ao Xu,Tianwen Qian,Yuqian Fu,Kailing Li,Yang Jiao,Jiacheng Zhang,Xiaoling Wang,Liang He
Main category: cs.CV
TL;DR: ToG-Bench是第一个面向任务的时空视频基准测试,专注于从第一人称视角定位任务相关对象,强调任务导向的推理和显式-隐式双重定位。
Details
Motivation: 现有时空视频定位研究多集中于对象中心和描述性指令,忽略了任务导向推理在实现目标导向交互中的重要性。Contribution: 提出了ToG-Bench基准测试,首次结合任务导向、显式-隐式双重定位和一多对应定位特性,并设计了专门的评估指标。
Method: 基于ScanNet视频数据,通过半自动化流程(结合基础模型标注和人工优化)构建了包含2,704条任务的标注数据集。
Result: 实验揭示了任务导向STVG的内在挑战及在显式-隐式和多对象定位中的性能差距。
Insight: 任务导向的定位需要更强的上下文推理能力,显式与隐式对象的定位难度差异显著。
Abstract: A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..
[76] Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning
Ge-Peng Ji,Jingyi Liu,Deng-Ping Fan,Nick Barnes
Main category: cs.CV
TL;DR: Colon-X是一个开源项目,致力于提升结肠镜检查中的多模态智能,通过构建ColonVQA数据集(包含110万+视觉问答条目)和ColonReason数据集,并开发首个R1风格模型ColonR1,实现了从多模态理解到临床推理的进阶。
Details
Motivation: 当前结肠镜检查中多模态大型语言模型的输出在临床应用中仍不够稳健和可信,需提升其从多模态理解到临床推理的能力。Contribution: 1) 构建了ColonVQA,史上最全面的结肠镜多模态数据集;2) 提出了ColonReason数据集和ColonR1模型,填补了临床推理的空白。
Method: 1) 系统性评估22种多模态模型的泛化性和可靠性;2) 提出基于任务自适应奖励和梯度稳定优化的ColonR1模型。
Result: ColonR1在数据稀缺条件下达到56.61%准确率,优于监督微调25.22%,为多模态结肠镜分析设定了新基准。
Insight: 临床推理是结肠镜智能化的关键步骤,任务自适应奖励和梯度优化技术能有效提升模型的推理能力。
Abstract: In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.
[77] GaussianBlender: Instant Stylization of 3D Gaussians with Disentangled Latent Spaces
Melis Ocal,Xiaoyan Xing,Yue Li,Ngo Anh Vien,Sezer Karaoglu,Theo Gevers
Main category: cs.CV
TL;DR: GaussianBlender提出了一种即时3D风格化的前馈框架,通过解耦的潜在空间实现高效的文本驱动编辑,避免了传统方法的多视图不一致性和耗时优化。
Details
Motivation: 3D风格化在游戏开发、虚拟现实和数字艺术中需求巨大,但现有的基于文本的方法通常依赖耗时的单实例优化,且存在多视图不一致的问题,限制了其大规模应用的潜力。Contribution: GaussianBlender是首个前馈式的文本驱动3D风格化框架,能够即时推理中进行编辑,同时通过解耦的潜在空间保留几何形状和多视图一致性。
Method: 方法利用空间分组的3D高斯学习解耦的几何和外观潜在空间,并通过潜在扩散模型进行文本条件的编辑。
Result: 实验表明,GaussianBlender不仅实现了高质量的即时风格化,还超越了需要单实例优化的方法,提供了实用的大规模3D风格化解决方案。
Insight: 解耦的潜在空间是实现高效、多视图一致的3D编辑的关键,而前馈式框架能够显著提升风格化的速度和实用性。
Abstract: 3D stylization is central to game development, virtual reality, and digital arts, where the demand for diverse assets calls for scalable methods that support fast, high-fidelity manipulation. Existing text-to-3D stylization methods typically distill from 2D image editors, requiring time-intensive per-asset optimization and exhibiting multi-view inconsistency due to the limitations of current text-to-image models, which makes them impractical for large-scale production. In this paper, we introduce GaussianBlender, a pioneering feed-forward framework for text-driven 3D stylization that performs edits instantly at inference. Our method learns structured, disentangled latent spaces with controlled information sharing for geometry and appearance from spatially-grouped 3D Gaussians. A latent diffusion model then applies text-conditioned edits on these learned representations. Comprehensive evaluations show that GaussianBlender not only delivers instant, high-fidelity, geometry-preserving, multi-view consistent stylization, but also surpasses methods that require per-instance test-time optimization - unlocking practical, democratized 3D stylization at scale.
[78] Active Visual Perception: Opportunities and Challenges
Yian Li,Xiaoyu Guo,Hao Zhang,Shuiwang Li,Xiaowei Dai
Main category: cs.CV
TL;DR: 這篇論文探討主動視覺感知的機遇與挑戰,強調其在動態環境中比被動系統更具優勢,但也面臨實時處理、決策和多模態整合等難題。
Details
Motivation: 傳統被動視覺系統在複雜環境中難以獲取足夠信息,主動視覺感知通過動態交互能更有效地達成目標,但其技術潛力尚未完全釋放。Contribution: 論文全面綜述了主動視覺感知的潛力、當前研究進展及待解決的關鍵挑戰,為未來研究提供方向。
Method: 通過分析主動視覺感知的定義、應用場景與現有技術,歸納其優勢與局限性。
Result: 指出主動視覺感知在機器人、自動駕駛等領域有廣泛應用,但需解決實時處理、動態環境決策等問題。
Insight: 主動視覺感知的成功依賴於跨學科整合,包括計算機視覺、機器學習和機器人技術的協同發展。
Abstract: Active visual perception refers to the ability of a system to dynamically engage with its environment through sensing and action, allowing it to modify its behavior in response to specific goals or uncertainties. Unlike passive systems that rely solely on visual data, active visual perception systems can direct attention, move sensors, or interact with objects to acquire more informative data. This approach is particularly powerful in complex environments where static sensing methods may not provide sufficient information. Active visual perception plays a critical role in numerous applications, including robotics, autonomous vehicles, human-computer interaction, and surveillance systems. However, despite its significant promise, there are several challenges that need to be addressed, including real-time processing of complex visual data, decision-making in dynamic environments, and integrating multimodal sensory inputs. This paper explores both the opportunities and challenges inherent in active visual perception, providing a comprehensive overview of its potential, current research, and the obstacles that must be overcome for broader adoption.
[79] Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic, Interpretable, Perceptual Metric Between Images
Paula Seidler,Neill D. F. Campbell,Ivor J A Simpson
Main category: cs.CV
TL;DR: 该论文提出了结构化不确定性相似性评分(SUSS),一种基于概率生成模型的感知度量方法,通过结构化多元正态分布表示图像感知组件,结合人类感知数据集训练的权重,实现透明且与人类视觉高度对齐的相似性评估。
Details
Motivation: 现有的感知相似性评分方法(如LPIPS)依赖复杂的非线性特征且缺乏可解释性,而手工设计的度量(如SSIM)则无法捕捉关键感知特性。SUSS旨在填补这一空白。Contribution: 提出了SUSS方法,通过学习图像特有的线性变换和结构化概率模型,实现了高可解释性和与人类感知的对齐;同时展示了其在感知损失函数中的稳定优化和竞争性表现。
Method: SUSS将图像建模为由结构化多元正态分布表示的感知组件,通过自监督生成训练学习人类难以察觉的增强,并结合人类数据集学习权重,最终通过加权对数概率计算相似性。
Result: SUSS在人类感知判断上表现出色,支持局部可解释性分析,并在下游成像任务中展示了稳定的优化行为和竞争性能。
Insight: SUSS成功结合了生成模型的概率建模能力和人类感知数据的先验,为感知度量提供了透明且高效的解决方案。
Abstract: Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties. We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling. SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.
[80] PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention
Ziwen Li,Xin Wang,Hanlue Zhang,Runnan Chen,Runqi Lin,Xiao He,Han Huang,Yandong Guo,Fakhri Karray,Tongliang Liu,Mingming Gong
Main category: cs.CV
TL;DR: PosA-VLA通过姿势条件锚注意力机制提升动作生成的精确性,解决了现有VLA模型因空间均匀感知而导致的冗余动作问题,展示了高效且轻量化的优势。
Details
Motivation: 现有VLA模型在生成目标导向动作时存在冗余和不稳定的问题,限制了其在时间敏感场景中的应用。作者认为这些问题源于空间均匀的感知场,导致模型容易被目标无关的对象分散注意力。Contribution: 提出了PosA-VLA框架,通过姿势条件锚注意力机制提高动作生成的精确性和效率,无需额外感知模块,且具有轻量化特点。
Method: 采用姿势条件监督锚定视觉注意力,将模型的感知引导至任务相关区域,从而更好地对齐指令语义与可操作的视觉线索。
Result: 在多样化的机器人操作基准测试中实现了精确且高效的行为,并在复杂环境中展示了鲁棒的泛化能力。
Insight: 通过姿势条件锚注意力机制可以显著提升VLA模型的动作生成能力,同时保持轻量化和高效性。
Abstract: The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model’s perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
[81] Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification
Jiaze Li,Yan Lu,Bin Liu,Guojun Yin,Mang Ye
Main category: cs.CV
TL;DR: 论文提出了一种双级模态去偏学习框架(DMDL),通过模型和优化层面的干预,解决了无监督可见-红外行人重识别中的模态偏差问题。
Details
Motivation: 现有的两阶段学习流程在无监督可见-红外行人重识别(USL-VI-ReID)中表现良好,但会引入模态偏差,影响身份判别和泛化能力。为了解决这一问题,作者提出了DMDL框架。Contribution: 主要贡献包括:1)提出了DMDL框架;2)设计了因果启发的调整干预(CAI)模块;3)引入了协作无偏训练(CBT)策略。
Method: 方法包括:1)在模型层面,使用CAI模块替换基于似然的建模;2)在优化层面,通过CBT策略整合模态特定增强、标签细化和特征对齐。
Result: 实验表明,DMDL能够学习模态不变特征并生成更具泛化性的模型。
Insight: 通过因果建模和协作训练策略,可以有效减少模态偏差对模型性能的影响。
Abstract: Two-stage learning pipeline has achieved promising results in unsupervised visible-infrared person re-identification (USL-VI-ReID). It first performs single-modality learning and then operates cross-modality learning to tackle the modality discrepancy. Although promising, this pipeline inevitably introduces modality bias: modality-specific cues learned in the single-modality training naturally propagate into the following cross-modality learning, impairing identity discrimination and generalization. To address this issue, we propose a Dual-level Modality Debiasing Learning (DMDL) framework that implements debiasing at both the model and optimization levels. At the model level, we propose a Causality-inspired Adjustment Intervention (CAI) module that replaces likelihood-based modeling with causal modeling, preventing modality-induced spurious patterns from being introduced, leading to a low-biased model. At the optimization level, a Collaborative Bias-free Training (CBT) strategy is introduced to interrupt the propagation of modality bias across data, labels, and features by integrating modality-specific augmentation, label refinement, and feature alignment. Extensive experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
[82] Research on Brain Tumor Classification Method Based on Improved ResNet34 Network
Yufeng Li,Wenchao Zhao,Bo Dang,Weimin Wang
Main category: cs.CV
TL;DR: 该论文提出了一种基于改进ResNet34网络的脑肿瘤分类方法,通过多尺度特征提取和通道注意力机制,提高了分类效率和准确性。
Details
Motivation: 传统放射学图像解读依赖手动方法,效率低且精度不足,浅层卷积神经网络也表现不佳,因此需要一种更高效的脑肿瘤分类方法。Contribution: 提出改进的ResNet34网络,结合多尺度输入模块和Inception v2模块,并引入通道注意力机制,显著提高了分类精度并减少了参数数量。
Method: 以ResNet34为骨干网络,添加多尺度输入模块作为第一层,Inception v2模块用于残差下采样层,并集成通道注意力机制以优化特征权重。
Result: 五折交叉实验显示,改进模型的平均分类准确率达到98.8%,比原始ResNet34高1%,且参数数量仅为原模型的80%。
Insight: 多尺度特征提取和通道注意力机制的结合能有效提升分类性能,同时实现模型轻量化。
Abstract: Previously, image interpretation in radiology relied heavily on manual methods. However, manual classification of brain tumor medical images is time-consuming and labor-intensive. Even with shallow convolutional neural network models, the accuracy is not ideal. To improve the efficiency and accuracy of brain tumor image classification, this paper proposes a brain tumor classification model based on an improved ResNet34 network. This model uses the ResNet34 residual network as the backbone network and incorporates multi-scale feature extraction. It uses a multi-scale input module as the first layer of the ResNet34 network and an Inception v2 module as the residual downsampling layer. Furthermore, a channel attention mechanism module assigns different weights to different channels of the image from a channel domain perspective, obtaining more important feature information. The results after a five-fold crossover experiment show that the average classification accuracy of the improved network model is approximately 98.8%, which is not only 1% higher than ResNet34, but also only 80% of the number of parameters of the original model. Therefore, the improved network model not only improves accuracy but also reduces clutter, achieving a classification effect with fewer parameters and higher accuracy.
[83] HieroGlyphTranslator: Automatic Recognition and Translation of Egyptian Hieroglyphs to English
Ahmed Nasser,Marwan Mohamed,Alaa Sherif,Basmala Mahmoud,Shereen Yehia,Asmaa Saad,Mariam S. El-Rahmany,Ensaf H. Mohamed
Main category: cs.CV
TL;DR: 提出了一种基于深度学习的埃及象形文字自动识别与翻译方法,分为分割、编码映射和翻译三个阶段,BLEU分数达42.2。
Details
Motivation: 埃及象形文字因图案复杂且单字多义,翻译难度大。深度学习技术的快速发展为解决这一问题提供了可能。Contribution: 提出了结合Contour、Detectron2和CNN的三阶段方法,实现了象形文字的高效识别与翻译。
Method: 分三阶段:1) 使用Contour和Detectron2进行分割;2) 将符号映射到Gardiner编码;3) 用CNN模型翻译。
Result: 模型BLEU分数达42.2,优于先前研究。
Insight: 结合多阶段方法和高级深度学习工具(如Detectron2)可显著提升复杂符号系统的翻译效果。
Abstract: Egyptian hieroglyphs, the ancient Egyptian writing system, are composed entirely of drawings. Translating these glyphs into English poses various challenges, including the fact that a single glyph can have multiple meanings. Deep learning translation applications are evolving rapidly, producing remarkable results that significantly impact our lives. In this research, we propose a method for the automatic recognition and translation of ancient Egyptian hieroglyphs from images to English. This study utilized two datasets for classification and translation: the Morris Franken dataset and the EgyptianTranslation dataset. Our approach is divided into three stages: segmentation (using Contour and Detectron2), mapping symbols to Gardiner codes, and translation (using the CNN model). The model achieved a BLEU score of 42.2, a significant result compared to previous research.
[84] A Robust Camera-based Method for Breath Rate Measurement
Alexey Protopopov
Main category: cs.CV
TL;DR: 该论文提出了一种基于摄像头的鲁棒性呼吸率测量方法,通过数学变换实现了高精度(相对误差小于5%),并在志愿者视频测试中表现优于先前方法,平均绝对误差为0.57次/分钟。
Details
Motivation: 现有呼吸率测量方法要么在理想条件下测试,要么精度不足,因此需要一种更鲁棒的方法以适应现实场景中受试者运动带来的干扰。Contribution: 提出了一种结合数学变换的鲁棒呼吸率测量方法,对受试者运动引起的失真具有更强的抵抗力,且仅需低成本硬件。
Method: 通过数学变换处理视频数据,提取呼吸信号,并结合算法优化以减少运动干扰的影响。
Result: 在14名志愿者的2小时30分钟视频测试中,该方法平均绝对误差为0.57次/分钟,相对误差小于5%,优于先前研究。
Insight: 该方法证明了低成本摄像头在非理想条件下仍可实现高精度呼吸率测量,为远程医疗监测提供了潜在应用。
Abstract: Proliferation of cheap and accessible cameras makes it possible to measure a subject’s breath rate from video footage alone. Recent works on this topic have proposed a variety of approaches for accurately measuring human breath rate, however they are either tested in near-ideal conditions, or produce results that are not sufficiently accurate. The present study proposes a more robust method to measure breath rate in humans with minimal hardware requirements using a combination of mathematical transforms with a relative deviation from the ground truth of less than 5%. The method was tested on videos taken from 14 volunteers with a total duration of over 2 hours 30 minutes. The obtained results were compared to reference data and the average mean absolute error was found to be at 0.57 respirations per minute, which is noticeably better than the results from previous works. The breath rate measurement method proposed in the present article is more resistant to distortions caused by subject movement and thus allows one to remotely measure the subject’s breath rate without any significant limitations on the subject’s behavior.
[85] Heatmap Pooling Network for Action Recognition from RGB Videos
Mengyuan Liu,Jinfu Liu,Yongkang Jiang,Bin He
Main category: cs.CV
TL;DR: 论文提出了一种新颖的热图池化网络(HP-Net),用于从RGB视频中进行动作识别,通过反馈池化模块提取信息丰富且鲁棒的特征,并结合空间-运动协同学习和文本细化调制模块,提升了识别性能。
Details
Motivation: 现有RGB视频动作识别方法存在信息冗余、噪声敏感和存储成本高等问题,需要一种能够充分利用视频有用信息并提取鲁棒特征的解决方案。Contribution: 1. 提出HP-Net,通过反馈池化模块提取鲁棒且简洁的特征;2. 设计了空间-运动协同学习和文本细化调制模块,结合多模态数据提升性能。
Method: HP-Net使用反馈池化模块提取热图特征,并结合空间-运动协同学习和文本调制模块,实现多模态数据融合。
Result: 在多个基准数据集(NTU RGB+D 60/120、Toyota-Smarthome和UAV-Human)上表现优于现有方法。
Insight: 热图池化模块能有效提取视频中的鲁棒特征,多模态数据融合进一步提升了动作识别的准确性。
Abstract: Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: https://github.com/liujf69/HPNet-Action.
[86] PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation, Diagnosis, and Few-Shot Cross-Modality Clinical Adaptation
Hania Ghouse,Maryam Alsharqi,Farhad R. Nezami,Muzammil Behzad
Main category: cs.CV
TL;DR: PULSE是一个多任务视觉-语言框架,整合了心脏分割、疾病诊断和生成临床报告的任务,并通过自监督表示和复合监督策略实现模态和数据集的泛化。
Details
Motivation: 当前心脏图像分析任务(如分割、分类和报告生成)通常由独立网络处理,缺乏统一框架。PULSE旨在通过多任务架构解决这一问题。Contribution: 提出PULSE,首次将心脏分割、疾病分类和临床报告生成统一为一个框架,并支持跨模态和数据集的自适应。
Method: 基于自监督表示和复合监督策略(涵盖区域重叠学习、像素分类和边界IoU优化),结合多尺度令牌重建解码器和共享全局表示。
Result: PULSE在多项任务中表现出色,能够泛化到不同数据集和模态,且需要极少监督即可适应新模态。
Insight: 任务不变的心脏先验知识和统一的视觉-语言框架是实现心脏图像分析可扩展性的关键。
Abstract: Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.
[87] Prostate biopsy whole slide image dataset from an underrepresented Middle Eastern population
Peshawa J. Muhammad Ali,Navin Vincent,Saman S. Abdulla,Han N. Mohammed Fadhl,Anders Blilie,Kelvin Szolnoky,Julia Anna Mielcarz,Xiaoyi Ji,Kimmo Kartasalo,Abdulbasit K. Al-Talabani,Nita Mulliqi
Main category: cs.CV
TL;DR: 该论文发布了来自中东地区(伊拉克)的339张前列腺活检全切片图像数据集,填补了现有公开数据集中中东人群代表性不足的空白,支持AI模型在多样化人群中的开发和验证。
Details
Motivation: 当前公开的病理学数据集主要来自西方人群,中东等地区的代表性不足,限制了AI模型的泛化能力。因此,发布这一数据集旨在促进AI模型在多样化人群中的适用性研究。Contribution: 提供了来自伊拉克的339张前列腺活检全切片图像,附带三位病理学家独立判定的Gleason评分和国际泌尿病理学会分级,支持多种分析任务。
Method: 数据集通过三种扫描仪(Leica、Hamamatsu和Grundium)获取,保留原始格式并去标识化,适用于分级一致性分析、颜色归一化和跨扫描仪鲁棒性评估。
Result: 数据集将存入Bioimage Archive(BIA),采用CC BY 4.0许可发布,为研究社区提供多样性数据资源。
Insight: 强调了数据集多样性对AI模型泛化能力的重要性,为中东地区病理学AI研究提供了基础数据支持。
Abstract: Artificial intelligence (AI) is increasingly used in digital pathology. Publicly available histopathology datasets remain scarce, and those that do exist predominantly represent Western populations. Consequently, the generalizability of AI models to populations from less digitized regions, such as the Middle East, is largely unknown. This motivates the public release of our dataset to support the development and validation of pathology AI models across globally diverse populations. We present 339 whole-slide images of prostate core needle biopsies from a consecutive series of 185 patients collected in Erbil, Iraq. The slides are associated with Gleason scores and International Society of Urological Pathology grades assigned independently by three pathologists. Scanning was performed using two high-throughput scanners (Leica and Hamamatsu) and one compact scanner (Grundium). All slides were de-identified and are provided in their native formats without further conversion. The dataset enables grading concordance analyses, color normalization, and cross-scanner robustness evaluations. Data will be deposited in the Bioimage Archive (BIA) under accession code: to be announced (TBA), and released under a CC BY 4.0 license.
[88] Diminishing Returns in Self-Supervised Learning
Oli Bridge,Huey Sun,Botond Branyicskai-Nagy,Charles D’Ornano,Shomit Basu
Main category: cs.CV
TL;DR: 本文研究了小规模Vision Transformer(ViT)在预训练、中间微调和下游任务中的边际收益,发现预训练和微调虽有益但收益递减,而中间微调可能因任务机制差异对下游性能产生负面影响。建议小规模ViT应专注于针对性预训练和谨慎数据选择。
Details
Motivation: Transformer架构在计算机视觉和NLP中表现优异,但通常需要大量参数和训练数据。本文旨在探索小规模ViT(仅5M参数)在不同训练阶段的边际收益,以优化资源利用。Contribution: 揭示了小规模ViT在预训练和中间微调中的收益递减现象,指出中间微调可能因任务差异损害下游性能,为小规模模型训练提供了实用建议。
Method: 使用三个不同的数据集和训练目标,对小规模ViT进行预训练、中间微调和下游任务实验,分析各阶段的性能变化。
Result: 预训练和微调对小规模ViT有益但收益递减;中间微调因任务机制差异可能对下游性能产生负面影响。
Insight: 小规模ViT应优先针对性预训练和精选数据,避免盲目堆叠中间任务以节省计算资源并避免性能下降。
Abstract: While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.
[89] Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy
Jorge Tapias Gomez,Despoina Kanata,Aneesh Rangnekar,Christina Lee,Julio Garcia-Aguilar,Joshua Jesse Smith,Harini Veeraraghavan
Main category: cs.CV
TL;DR: 作者提出了一种名为SSDCA的双交叉注意力Siamese Transformer模型,用于在直肠癌患者的随访内窥镜图像中早期检测局部复发(LR),并区分临床完全缓解(cCR)。该模型通过结合纵向图像和双交叉注意力机制,显著提升了分类性能。
Details
Motivation: 直肠癌患者在完成全辅助治疗后,临床完全缓解(cCR)的患者通常会选择观察等待(WW)策略,但需要早期准确检测局部复发(LR)以避免远处转移。传统方法缺乏客观性和准确性,因此需要一个鲁棒的模型来处理这一问题。Contribution: 1. 提出了SSDCA模型,结合Siamese架构和双交叉注意力机制,无需图像空间对齐即可识别LR和cCR;2. 展示了模型对图像变化的鲁棒性;3. 提供了显著优于基准方法的分类性能。
Method: 1. 使用预训练的Swin Transformer提取域无关特征;2. 引入双交叉注意力机制增强对不同扫描特征的关注;3. 结合纵向图像对进行训练和评估。
Result: 在135名患者的图像对上训练,62名患者的数据上测试,SSDCA达到了81.76%的平衡准确率、90.07%的敏感性和72.86%的特异性。UMAP聚类显示其特征提取具有良好的区分性。
Insight: 双交叉注意力机制有效提升了模型对纵向图像的判别能力,且无需严格的图像对齐,为医学图像分析提供了新思路。
Abstract: Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76% $\pm$ 0.04), sensitivity (90.07% $\pm$ 0.08), and specificity (72.86% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning.
[90] Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence
Shuai Yang,Junxin Lin,Yifan Zhou,Ziwei Liu,Chen Change Loy
Main category: cs.CV
TL;DR: 论文提出FRESCO,通过结合帧内与帧间对应关系增强时空一致性,实现零样本视频翻译与编辑的高质量、连贯性生成。
Details
Motivation: 现有零样本视频方法主要通过注意力机制整合帧间对应关系,但软约束无法保证时序一致性,导致视频编辑结果不连贯。Contribution: 提出FRESCO框架,融合帧内与帧间对应关系,构建更强的时空约束,显式优化特征以实现高质量的时空一致性。
Method: 1. 结合帧内与帧间对应关系;2. 显式优化特征而非仅依赖注意力引导;3. 专注于零样本视频翻译与文本引导的视频编辑任务。
Result: 实验表明FRESCO在生成高质量、连贯视频上显著优于现有零样本方法,验证了时空约束的有效性。
Insight: 时空一致性的显式优化(而非仅依赖注意力机制)是零样本视频任务中提升生成质量的关键。
Abstract: The remarkable success in text-to-image diffusion models has motivated extensive investigation of their potential for video applications. Zero-shot techniques aim to adapt image diffusion models for videos without requiring further model training. Recent methods largely emphasize integrating inter-frame correspondence into attention mechanisms. However, the soft constraint applied to identify the valid features to attend is insufficient, which could lead to temporal inconsistency. In this paper, we present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint. This enhancement ensures a consistent transformation of semantically similar content between frames. Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video, significantly enhancing the visual coherence of manipulated videos. We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing. Comprehensive experiments demonstrate the effectiveness of our framework in generating high-quality, coherent videos, highlighting a significant advance over current zero-shot methods.
[91] UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework
Youxin Pang,Yong Zhang,Ruizhi Shao,Xiang Deng,Feng Gao,Xu Xiaoming,Xiaoming Wei,Yebin Liu
Main category: cs.CV
TL;DR: UniMo提出了一种创新的自回归模型,首次在统一框架中联合建模2D视频和3D人体运动,实现了两种模态的同步生成与理解。
Details
Motivation: 现有方法多专注于以另一种模态为条件生成单一模态,或将它们与其他模态(如文本、音频)结合。统一2D视频和3D运动的同步优化与生成仍未被探索,挑战巨大。Contribution: 提出了统一2D视频和3D运动的自回归框架,设计了新的3D运动分词器和序列建模策略,首次实现两种模态的同步生成与理解。
Method: 利用独立嵌入层缓解分布差异,将视频和3D运动建模为统一的标记序列;设计了包含多专家解码器的3D运动分词器,保持空间信息。
Result: 实验证明UniMo能同步生成视频与运动,并实现精确动作捕捉,展现了统一建模的潜力。
Insight: 借鉴大语言模型的多模态融合能力,为人类中心信息的整合及更广泛的多模态联合建模提供了新思路。
Abstract: We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the LLM’s ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks within a single framework, proving the effectiveness of unified modeling. Moreover, to efficiently align with visual tokens and preserve 3D spatial information, we design a novel 3D motion tokenizer with a temporal expansion strategy, using a single VQ-VAE to produce quantized motion tokens. It features multiple expert decoders that handle body shapes, translation, global orientation, and body poses for reliable 3D motion reconstruction. Extensive experiments demonstrate that our method simultaneously generates corresponding videos and motions while performing accurate motion capture. This work taps into the capacity of LLMs to fuse diverse data types, paving the way for integrating human-centric information into existing models and potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.
[92] Beyond the Ground Truth: Enhanced Supervision for Image Restoration
Donghun Ryou,Inju Ha,Sanghyeok Chu,Bohyung Han
Main category: cs.CV
TL;DR: 该论文提出了一种通过增强训练数据的监督信号来提升图像恢复模型性能的新框架,生成超分辨率增强的真实标签图像以避免人工标注的局限性。
Details
Motivation: 在实际图像恢复任务中,由于数据采集的局限性,真实标签图像(ground truth)的质量限制了模型性能。为解决这一问题,论文希望通过增强真实标签来提供更高质量的监督信号。Contribution: 提出了一种新颖的框架,通过自适应频率掩码生成器增强真实标签图像,并在频率域融合原始与超分版本的图像,生成高质量监督信号。同时设计了一个轻量级的输出细化网络,可与现有恢复模型无缝集成。
Method: 利用条件频率掩码生成器学习自适应频率掩码,指导原始真实标签图像与其超分版本的频率成分选择性融合,生成增强的真实标签。这些标签用于训练一个轻量级输出细化网络。
Result: 实验表明,该方法能显著提升图像恢复质量,用户研究进一步验证了监督增强和输出细化的有效性。
Insight: 通过在频率域选择性增强真实标签的细节,可以避免幻觉伪影(hallucinated artifacts)并保留语义一致性,从而为图像恢复提供更可靠的监督信号。
Abstract: Deep learning-based image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth images using super-resolution by incorporating adaptive frequency masks, which are learned by a conditional frequency mask generator. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants, yielding enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models. Extensive experiments demonstrate that our approach consistently improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies. Code is available at https://github.com/dhryougit/Beyond-the-Ground-Truth.
[93] MUT3R: Motion-aware Updating Transformer for Dynamic 3D Reconstruction
Guole Shen,Tianchen Deng,Xingrui Qin,Nailin Wang,Jianyu Wang,Yanbo Wang,Yongtao Chen,Hesheng Wang,Jingchuan Wang
Main category: cs.CV
TL;DR: MUT3R提出了一种无需训练的框架,利用预训练Transformer中的隐式运动线索抑制动态内容,提升动态3D重建的时序一致性和相机位姿鲁棒性。
Details
Motivation: 现有基于状态的递归神经网络在静态3D重建中表现优异,但在动态场景中容易因非刚性区域导致运动伪影。研究发现预训练Transformer已编码了隐式运动线索但未充分利用。Contribution: 提出MUT3R,无需额外训练,通过注意力驱动的运动线索在推理过程中抑制动态内容,解决了动态3D重建中的运动伪影问题。
Method: 利用预训练Transformer的注意力图聚合隐式运动线索,设计注意力级门控模块,在特征层次早期抑制动态区域的影响。
Result: 在多个动态基准测试中提升了时序一致性和相机位姿鲁棒性,证明了方法的有效性。
Insight: 预训练Transformer已隐含运动信息,通过简单的注意力机制调整可显著改善动态场景的重建效果,为无需训练的3D重建提供了新思路。
Abstract: Recent stateful recurrent neural networks have achieved remarkable progress on static 3D reconstruction but remain vulnerable to motion-induced artifacts, where non-rigid regions corrupt attention propagation between the spatial memory and image feature. By analyzing the internal behaviors of the state and image token updating mechanism, we find that aggregating self-attention maps across layers reveals a consistent pattern: dynamic regions are naturally down-weighted, exposing an implicit motion cue that the pretrained transformer already encodes but never explicitly uses. Motivated by this observation, we introduce MUT3R, a training-free framework that applies the attention-derived motion cue to suppress dynamic content in the early layers of the transformer during inference. Our attention-level gating module suppresses the influence of dynamic regions before their artifacts propagate through the feature hierarchy. Notably, we do not retrain or fine-tune the model; we let the pretrained transformer diagnose its own motion cues and correct itself. This early regulation stabilizes geometric reasoning in streaming scenarios and leads to improvements in temporal consistency and camera pose robustness across multiple dynamic benchmarks, offering a simple and training-free pathway toward motion-aware streaming reconstruction.
[94] TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
Tao Wu,Li Yang,Gen Zhan,Yiting Liao,Junlin Li,Deliang Fu,Li Zhang,Limin Wang
Main category: cs.CV
TL;DR: TempR1是一个通过时空感知多任务强化学习框架,提升多模态大语言模型(MLLMs)时空理解的模型,通过多任务优化和定制奖励设计,实现了在多种时空任务上的最优性能。
Details
Motivation: 现有的强化学习方法在多模态大语言模型的时空理解上任务类型和数据有限,导致泛化能力不足。TempR1旨在通过多任务学习和优化算法提升模型的时空理解能力。Contribution: 提出了TempR1框架,设计了多任务语料库和基于GRPO算法的优化机制,并分类时空任务为三种对应类型,为每种类型定制奖励函数。
Method: 使用多任务强化学习,结合GRPO算法,设计针对不同时空任务的定制奖励机制,优化模型的跨任务性能。
Result: TempR1在多个基准测试中达到最优性能,并通过互补任务的联合优化增强了泛化和单任务性能。
Insight: 多任务学习和定制奖励机制能显著提升模型的时空理解能力,为时空推理提供了可扩展和原则性的范式。
Abstract: Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs’ temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.
[95] Training for Identity, Inference for Controllability: A Unified Approach to Tuning-Free Face Personalization
Lianyu Pang,Ji Zhou,Qiping Wang,Baoquan Zhao,Zhenguo Yang,Qing Li,Xudong Mao
Main category: cs.CV
TL;DR: UniID是一种统一的免调优人脸个性化框架,整合了文本嵌入和适配器方法,通过身份聚焦学习与归一化重缩放机制,同时实现高身份保真度和灵活的文本控制。
Details
Motivation: 现有方法难以同时实现高身份保真度和灵活的文本控制,UniID旨在解决这一问题。Contribution: 提出了UniID框架,结合了文本嵌入和适配器方法,通过训练推理策略实现身份与控制的平衡。
Method: 采用身份聚焦学习和归一化重缩放机制,训练时专注身份特征,推理时恢复文本控制。
Result: 在六种前沿方法中表现最佳,身份保真度和文本控制性均优于现有方法。
Insight: 训练与推理的分离设计是实现高保真与灵活控制的关键,两种方法互补优化。
Abstract: Tuning-free face personalization methods have developed along two distinct paradigms: text embedding approaches that map facial features into the text embedding space, and adapter-based methods that inject features through auxiliary cross-attention layers. While both paradigms have shown promise, existing methods struggle to simultaneously achieve high identity fidelity and flexible text controllability. We introduce UniID, a unified tuning-free framework that synergistically integrates both paradigms. Our key insight is that when merging these approaches, they should mutually reinforce only identity-relevant information while preserving the original diffusion prior for non-identity attributes. We realize this through a principled training-inference strategy: during training, we employ an identity-focused learning scheme that guides both branches to capture identity features exclusively; at inference, we introduce a normalized rescaling mechanism that recovers the text controllability of the base diffusion model while enabling complementary identity signals to enhance each other. This principled design enables UniID to achieve high-fidelity face personalization with flexible text controllability. Extensive experiments against six state-of-the-art methods demonstrate that UniID achieves superior performance in both identity preservation and text controllability. Code will be available at https://github.com/lyuPang/UniID
[96] DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment
Sheng-Hao Liao,Shang-Fu Chen,Tai-Ming Huang,Wen-Huang Cheng,Kai-Lung Hua
Main category: cs.CV
TL;DR: DirectDrag提出了一种无需手动遮罩和文本提示的拖拽式图像编辑框架,通过自动软遮罩生成和读出引导的特征对齐机制,实现了高保真和精确的图像编辑。
Details
Motivation: 现有基于拖拽的图像编辑方法依赖手动遮罩和文本提示来保持语义保真和运动精度,DirectDrag旨在消除这些约束,解决无遮罩时的视觉伪影和无提示时的空间控制问题。Contribution: 1) Auto Soft Mask Generation模块自动推断可编辑区域;2) Readout-Guided Feature Alignment机制利用扩散模型的中间激活保持结构一致性。
Method: DirectDrag结合自动软遮罩生成和读出引导的特征对齐,无需手动输入遮罩或提示即可实现高保真编辑。
Result: 在DragBench和实际场景中表现出色,图像质量和拖拽精度优于现有方法。
Insight: 生成模型的固有能力和中间激活可用于实现高效、精确的图像编辑,无需额外的用户输入。
Abstract: Drag-based image editing using generative models provides intuitive control over image structures. However, existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision. Removing these constraints creates a fundamental trade-off: visual artifacts without masks and poor spatial control without prompts. To address these limitations, we propose DirectDrag, a novel mask- and prompt-free editing framework. DirectDrag enables precise and efficient manipulation with minimal user input while maintaining high image fidelity and accurate point alignment. DirectDrag introduces two key innovations. First, we design an Auto Soft Mask Generation module that intelligently infers editable regions from point displacement, automatically localizing deformation along movement paths while preserving contextual integrity through the generative model’s inherent capacity. Second, we develop a Readout-Guided Feature Alignment mechanism that leverages intermediate diffusion activations to maintain structural consistency during point-based edits, substantially improving visual fidelity. Despite operating without manual mask or prompt, DirectDrag achieves superior image quality compared to existing methods while maintaining competitive drag accuracy. Extensive experiments on DragBench and real-world scenarios demonstrate the effectiveness and practicality of DirectDrag for high-quality, interactive image manipulation. Project Page: https://frakw.github.io/DirectDrag/. Code is available at: https://github.com/frakw/DirectDrag.
[97] DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation
Zexin Lin,Hawen Wan,Yebin Zhong,Xiaoqiang
Main category: cs.CV
TL;DR: DIQ-H是首个评估视觉语言模型(VLMs)在动态视觉退化下鲁棒性的基准,重点关注时间序列中的幻觉持续性和错误恢复。
Details
Motivation: 现实世界中VLMs需处理不完美的连续视觉流,而现有基准仅关注静态高质量图像,忽略了时间退化引发的幻觉持续性。Contribution: 1. 提出DIQ-H基准,模拟物理退化(如运动模糊、噪声);2. 设计不确定性引导迭代优化(UIR)方法,提升伪标注质量。
Method: DIQ-H通过多轮问答任务评估幻觉持续性,结合UIR方法利用轻量级VLMs和不确定性过滤生成可靠伪标签。
Result: 实验显示,GPT-4o仅78.5%恢复率,开源模型时间一致性低于60%,凸显VLM在实际部署中的鲁棒性不足。
Insight: 时间退化对VLM性能影响显著,需在设计目标中加强错误恢复和时间一致性能力。
Abstract: Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.
[98] Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Jialuo Li,Bin Li,Jiahao Li,Yan Lu
Main category: cs.CV
TL;DR: DIG框架提出了一种根据查询类型动态选择视频帧的策略,全局查询均匀采样,局部查询采用专用流程,提升了长视频理解的效率与性能。
Details
Motivation: 长视频理解的挑战在于有限的计算资源和上下文长度,现有方法对所有查询使用复杂的帧选择机制,计算开销大且不必要。Contribution: 提出了DIG框架,根据查询类型(全局或局部)动态调整帧选择策略,显著提升了效率和性能。
Method: DIG分为两部分:全局查询使用均匀采样;局部查询激活专用流程提取相关帧。训练Free避免了额外计算开销。
Result: 在三个长视频理解基准测试中,DIG均优于现有基线,输入帧扩展到256时仍保持高效。
Insight: 全局查询无需复杂帧选择机制,而局部查询则需要针对性的策略,这种动态适配是关键创新点。
Abstract: The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
[99] Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
Jisang Han,Sunghwan Hong,Jaewoo Jung,Wooseok Jang,Honggyu An,Qianqian Wang,Seungryong Kim,Chen Feng
Main category: cs.CV
TL;DR: 论文发现现有的前馈三维重建模型(如VGGT)虽然缺乏显式的离群点处理机制,但在特定层中自然地表现出抑制噪声的能力,可用于无监督的离群视角剔除。
Details
Motivation: 传统的三维重建方法通过几何验证和离群点剔除处理噪声图像,而前馈模型缺乏这些机制,导致在现实场景中性能下降。论文探索了前馈模型中潜在的噪声过滤能力。Contribution: 揭示了VGGT模型中特定层具有自然的离群点抑制行为,并提出利用这一特性进行无监督的离群视角剔除,无需额外微调或监督。
Method: 通过合成噪声图像分析VGGT模型的各层行为,识别出具有噪声抑制能力的特定层,并提取其内部表征用于噪声过滤。
Result: 在控制和现实数据集的实验中验证了该方法的有效性和泛化能力。
Insight: 前馈模型中某些层可能隐含了噪声过滤的能力,这为改进三维重建模型的鲁棒性提供了新思路。
Abstract: Reliable 3D reconstruction from in-the-wild image collections is often hindered by “noisy” images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.
[100] Ultra-lightweight Neural Video Representation Compression
Ho Man Kwan,Tianhao Peng,Ge Gao,Fan Zhang,Mike Nilsson,Andrew Gower,David Bull
Main category: cs.CV
TL;DR: NVRC-Lite是一种超轻量级神经视频表示压缩方法,通过结合多尺度特征网格和改进熵编码技术,显著提升了压缩性能和速度。
Details
Motivation: 现有的基于隐式神经表示(INR)的视频压缩方法在计算复杂度和编码速度上存在不足,限制了其实际应用。NVRC-Lite旨在解决这些问题,提供高效且低复杂度的解决方案。Contribution: 1. 提出NVRC-Lite,扩展了NVRC的轻量化能力。2. 引入多尺度特征网格提升INR在低复杂度下的性能。3. 提出基于八叉树的上下文模型加速熵编码。
Method: 1. 使用多尺度特征网格优化INR。2. 采用八叉树上下文模型替代自回归模型,提高熵编码效率。
Result: NVRC-Lite在PSNR和MS-SSIM上分别实现了21.03%和23.06%的BD-rate提升,编码和解码速度分别提高了8.4倍和2.5倍。
Insight: 多尺度特征网格和新型熵编码技术的结合是提升轻量化视频压缩性能的有效途径。
Abstract: Recent works have demonstrated the viability of utilizing over-fitted implicit neural representations (INRs) as alternatives to autoencoder-based models for neural video compression. Among these INR-based video codecs, Neural Video Representation Compression (NVRC) was the first to adopt a fully end-to-end compression framework that compresses INRs, achieving state-of-the-art performance. Moreover, some recently proposed lightweight INRs have shown comparable performance to their baseline codecs with computational complexity lower than 10kMACs/pixel. In this work, we extend NVRC toward lightweight representations, and propose NVRC-Lite, which incorporates two key changes. Firstly, we integrated multi-scale feature grids into our lightweight neural representation, and the use of higher resolution grids significantly improves the performance of INRs at low complexity. Secondly, we address the issue that existing INRs typically leverage autoregressive models for entropy coding: these are effective but impractical due to their slow coding speed. In this work, we propose an octree-based context model for entropy coding high-dimensional feature grids, which accelerates the entropy coding module of the model. Our experimental results demonstrate that NVRC-Lite outperforms C3, one of the best lightweight INR-based video codecs, with up to 21.03% and 23.06% BD-rate savings when measured in PSNR and MS-SSIM, respectively, while achieving 8.4x encoding and 2.5x decoding speedup. The implementation of NVRC-Lite will be made available.
[101] C3G: Learning Compact 3D Representations with 2K Gaussians
Honggyu An,Jaewoo Jung,Mungyeom Kim,Sunghwan Hong,Chaehyun Kim,Kazumi Fukuda,Minkyeong Jeon,Jisang Han,Takuya Narihira,Hyuna Ko,Junsu Kim,Yuki Mitsufuji,Seungryong Kim
Main category: cs.CV
TL;DR: C3G提出了一种新型的前馈框架,通过仅生成空间关键位置的紧凑3D高斯分布,减少冗余并提升特征提取效果。
Details
Motivation: 现有方法使用逐像素的3D高斯泼溅(Gaussian Splatting)进行重建,随后通过2D到3D的特征提升阶段实现场景理解,但存在高斯冗余、内存开销大以及多视角特征聚合效果差的问题。Contribution: C3G通过可学习的token和自注意力机制引导高斯生成,确保每个高斯分布能够跨视角整合相关视觉特征,从而实现紧凑且高效的3D表示。
Method: 1. 使用可学习的token通过自注意力聚合多视角特征;2. 利用学习到的注意力模式高效解码高斯分布;3. 仅生成关键空间位置的紧凑高斯分布。
Result: 在无姿态新视角合成、3D开放词汇分割和视角不变特征聚合任务中,C3G表现优异,实现了更高效的内存使用和更高的特征保真度。
Insight: 紧凑且几何意义明确的高斯分布足以支持高质量的3D场景重建和理解,同时显著降低冗余和内存开销。
Abstract: Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach’s effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.
[102] PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
Xiaolong Li,Youping Gu,Xi Lin,Weijie Wang,Bohan Zhuang
Main category: cs.CV
TL;DR: 论文提出了金字塔稀疏注意力(PSA),一种多级动态池化的KV表示方法,在高稀疏性下减少了信息损失,显著提升了视频理解和生成任务的效率与质量。
Details
Motivation: 现有稀疏注意力机制在高稀疏性下因二进制掩码导致信息损失严重,影响了模型性能。为了解决这一问题,论文提出了更细粒度的多级稀疏注意力机制。Contribution: 提出金字塔稀疏注意力(PSA),通过动态分配多级池化的KV表示,在高稀疏性下保留更多信息,实现了高效的视频理解和生成任务。
Method: PSA采用多级动态池化的KV表示,查询块根据重要性动态分配池化级别,结合硬件友好的解耦块-瓦设计,实现了高效执行。
Result: 在视频理解和生成任务中,PSA在效率和性能上均优于或媲美现有稀疏注意力基线,取得了更好的效率-质量权衡。
Insight: PSA的动态多级池化机制类似于计算机视觉中的特征金字塔网络,通过细粒度掩码设计在高稀疏性下保留了更多上下文信息。
Abstract: Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA
[103] Fast & Efficient Normalizing Flows and Applications of Image Generative Models
Sandeep Nagar
Main category: cs.CV
TL;DR: 该论文在生成模型(尤其是标准化流)的效率提升及其在计算机视觉中的应用方面提出了多项创新。
Details
Motivation: 提升生成模型的效率,并将其应用于解决农业、地质、自动驾驶和艺术修复等实际计算机视觉问题。Contribution: 1)标准化流架构的六项关键改进;2)在农业质量评估、地质制图、隐私保护和艺术修复等领域的生成模型应用。
Method: 1)提出可逆卷积层、高效耦合层和并行反演算法;2)应用条件GAN、自编码器和扩散模型解决实际问题。
Result: 在效率提升和多个实际应用中取得了显著成果,如参数减少、准确性提升和隐私保护改进。
Insight: 标准化流的改进和生成模型的多领域应用表明其在高效解决复杂问题中的潜力。
Abstract: This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
[104] RELIC: Interactive Video World Model with Long-Horizon Memory
Yicong Hong,Yiqun Mei,Chongjian Ge,Yiran Xu,Yang Zhou,Sai Bi,Yannick Hold-Geoffroy,Mike Roberts,Matthew Fisher,Eli Shechtman,Kalyan Sunkavalli,Feng Liu,Zhengqi Li,Hao Tan
Main category: cs.CV
TL;DR: RELIC是一个实时交互式视频世界模型,通过压缩的历史潜在令牌和KV缓存实现长时记忆,支持3D一致的隐式内容检索,并通过新的自强迫范式实现长时间生成。
Details
Motivation: 现有方法通常只关注实时长时流、一致的空间记忆或精确的用户控制中的一个方面,而RELIC旨在同时解决这三个挑战,以实现真正的交互式世界模型。Contribution: RELIC提出了一种统一的框架,支持实时长时记忆、3D一致的隐式内容检索和用户控制,并通过高效的自强迫范式扩展了生成时间。
Method: 采用自回归视频扩散蒸馏技术,压缩历史潜在令牌并使用KV缓存;通过双向教师模型和因果学生模型的蒸馏实现长时间生成。
Result: RELIC在16 FPS下实现实时生成,展现出更准确的动作跟随、稳定的长时流和更强的空间记忆检索能力。
Insight: 结合压缩记忆结构和高效的蒸馏方法,RELIC为下一代交互式世界建模提供了坚实的基础。
Abstract: A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.
[105] SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Siyi Chen,Mikaela Angelina Uy,Chan Hee Song,Faisal Ladhak,Adithyavairavan Murali,Qing Qu,Stan Birchfield,Valts Blukis,Jonathan Tremblay
Main category: cs.CV
TL;DR: SpaceTools提出了一种通过双交互强化学习(DIRL)增强视觉语言模型(VLMs)空间推理能力的方法,结合多种工具(如深度估计器、分割模型等)和多阶段训练,显著提升了性能。
Details
Motivation: 现有VLMs在精确空间推理方面表现不足,而传统方法依赖手工提示或固定工具流程,限制了模型发现最优工具使用模式的能力。Contribution: 提出了DIRL框架,通过两阶段训练(教学和探索)实现多工具协调,显著提升了VLMs的空间推理性能。
Method: DIRL结合交互强化学习(RL)训练单工具专家和前沿模型的工具使用轨迹,在探索阶段通过持续RL优化多工具协调。
Result: SpaceTools在空间理解基准(RoboSpatial-Home等)上实现了最先进性能,并在现实机器人操作中表现出色,比基线方法提升显著。
Insight: DIRL通过交互学习和多工具协调,为VLMs的精确空间推理提供了新思路,展示了工具增强的潜力。
Abstract: Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs’ ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.
[106] PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design
Jiazhe Wei,Ken Li,Tianyu Lao,Haofan Wang,Liang Wang,Caifeng Shan,Chenyang Si
Main category: cs.CV
TL;DR: PosterCopilot 是一个专注于专业平面设计的框架,通过引入渐进式三阶段训练策略,提升多模态模型的布局推理和可控编辑能力,并结合生成模型实现分层可控的迭代编辑。
Details
Motivation: 现有基于大型多模态模型的自动化设计方法在几何准确性、布局合理性及分层编辑需求上存在不足,难以满足专业设计工作流的迭代需求。Contribution: 1. 提出渐进式三阶段训练策略,提升模型的几何理解和美学推理能力;
2. 开发分层可控的迭代编辑工作流,结合生成模型实现精准元素调整和全局一致性维护。
Method: 1. 扰动监督微调;
2. 视觉对齐的强化学习;
3. 美学反馈的强化学习;
4. 结合生成模型的分层编辑工作流。
Result: 实验表明,PosterCopilot 能够生成几何准确、美学效果优越的布局,并提供前所未有的可控性。
Insight: 通过分阶段训练和生成模型的结合,可显著提升自动化设计工具的几何准确性和编辑灵活性,满足专业需求。
Abstract: Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.
[107] Unique Lives, Shared World: Learning from Single-Life Videos
Tengda Han,Sayna Ebrahimi,Dilara Gokay,Li Yang Ku,Maks Ovsjanikov,Iva Babukova,Daniel Zoran,Viorica Patraucean,Joao Carreira,Andrew Zisserman,Dima Damen
Main category: cs.CV
TL;DR: 該論文提出了一種新的『單一生命』學習範式,通過訓練一個獨特的視覺模型來處理單一個體拍攝的自我中心影片,利用多視角的自然捕捉來自監督學習視覺編碼器。實驗結果顯示,這種方法能夠形成高度一致的幾何理解,並將學到的表示遷移到下游任務中。
Details
Motivation: 通過單一生命的多視角影片,捕捉世界的共享結構,證明這種結構能夠為視覺表示學習提供一致性信號。Contribution: 1. 提出單一生命學習範式;2. 開發了基於交叉注意力的度量方法來量化內部表示的功能對齊;3. 證明單一生命模型能夠在未見環境中進行有效的遷移學習。
Method: 利用單一生命的自我中心影片,通過自監督學習訓練視覺編碼器,並引入交叉注意力度量來評估模型內部表示的一致性。
Result: 1. 不同生命訓練的模型展現高度一致的幾何理解;2. 單一生命模型能夠有效遷移到深度估計等任務;3. 與多樣網絡數據訓練效果相當。
Insight: 世界的共享結構是視覺表示學習的有力信號,即使是單一生命的數據也能捕捉這種一致性。
Abstract: We introduce the “single-life” learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person’s life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.
physics.ins-det [Back]
[108] Kaleidoscopic Scintillation Event Imaging
Alex Bocchieri,John Mamish,David Appleyard,Andreas Velten
Main category: physics.ins-det
TL;DR: 该论文提出了一种新型的闪烁体设计(万花筒式闪烁体),通过增加光子收集效率并保留空间信息,实现了高分辨率的高能粒子事件成像。
Details
Motivation: 现有方法通常使用快速单像素探测器检测闪烁事件,而相机虽能提供空间分辨率,但只能捕捉多个事件的平均值,难以对单个粒子事件进行成像。单光子雪崩二极管(SPAD)相机结合了速度和空间分辨率,但仍面临事件亮度极低的挑战。Contribution: 1. 提出万花筒式闪烁体设计,通过镜面反射增加光子收集;2. 开发理论模型和算法,估计事件的3D位置;3. 利用商用CMOS SPAD相机实现高分辨率事件测量。
Method: 采用万花筒几何结构的闪烁体,生成已知位置的镜面反射事件图像,结合算法从这些图像中估计原始事件的3D位置。
Result: 实验表明,万花筒式闪烁体设计能够为先进辐射成像技术提供足够的光子收集能力。
Insight: 通过几何设计优化光子收集,可在极低亮度条件下实现单个高能粒子事件的3D成像,为辐射检测和粒子追踪提供了新方法。
Abstract: Scintillators are transparent materials that interact with high-energy particles and emit visible light as a result. They are used in state of the art methods of measuring high-energy particles and radiation sources. Most existing methods use fast single-pixel detectors to detect and time scintillation events. Cameras provide spatial resolution but can only capture an average over many events, making it difficult to image the events associated with an individual particle. Emerging single-photon avalanche diode cameras combine speed and spatial resolution to enable capturing images of individual events. This allows us to use machine vision techniques to analyze events, enabling new types of detectors. The main challenge is the very low brightness of the events. Techniques have to work with a very limited number of photons. We propose a kaleidoscopic scintillator to increase light collection in a single-photon camera while preserving the event’s spatial information. The kaleidoscopic geometry creates mirror reflections of the event in known locations for a given event location that are captured by the camera. We introduce theory for imaging an event in a kaleidoscopic scintillator and an algorithm to estimate the event’s 3D position. We find that the kaleidoscopic scintillator design provides sufficient light collection to perform high-resolution event measurements for advanced radiation imaging techniques using a commercial CMOS single-photon camera. Code and data are available at https://github.com/bocchs/kaleidoscopic_scintillator.
eess.IV [Back]
[109] Tada-DIP: Input-adaptive Deep Image Prior for One-shot 3D Image Reconstruction
Evan Bell,Shijun Liang,Ismail Alkhouri,Saiprasad Ravishankar
Main category: eess.IV
TL;DR: 提出了Tada-DIP方法,结合输入自适应和去噪正则化,解决了3D图像重建中的过拟合问题,并在稀疏视图X射线CT重建中表现出色。
Details
Motivation: Deep Image Prior(DIP)在3D图像重建中应用有限,且存在过拟合问题。Tada-DIP旨在解决这些问题,提升3D重建质量。Contribution: 提出Tada-DIP方法,结合输入自适应和去噪正则化,显著提升了3D图像重建的质量,避免了DIP的过拟合问题。
Method: 通过输入自适应和去噪正则化的结合,Tada-DIP优化了3D图像重建的性能。
Result: 在稀疏视图X射线CT重建实验中,Tada-DIP的表现优于无训练数据的基线方法,并与使用全样本数据训练的监督网络相当。
Insight: 输入自适应和正则化是提升DIP在3D重建中表现的关键,尤其是避免过拟合。
Abstract: Deep Image Prior (DIP) has recently emerged as a promising one-shot neural-network based image reconstruction method. However, DIP has seen limited application to 3D image reconstruction problems. In this work, we introduce Tada-DIP, a highly effective and fully 3D DIP method for solving 3D inverse problems. By combining input-adaptation and denoising regularization, Tada-DIP produces high-quality 3D reconstructions while avoiding the overfitting phenomenon that is common in DIP. Experiments on sparse-view X-ray computed tomography reconstruction validate the effectiveness of the proposed method, demonstrating that Tada-DIP produces much better reconstructions than training-data-free baselines and achieves reconstruction performance on par with a supervised network trained using a large dataset with fully-sampled volumes.
cs.CY [Back]
[110] Culture Affordance Atlas: Reconciling Object Diversity Through Functional Mapping
Joan Nwatu,Longju Bai,Oana Ignat,Rada Mihalcea
Main category: cs.CY
TL;DR: 该论文提出了一种功能中心的框架(Culture Affordance Atlas),通过重新标注和分类对象的功能,以减少主流视觉语言数据集中的文化偏见。该方法显著缩小了高低收入群体间的性能差距。
Details
Motivation: 主流视觉语言数据集存在文化偏见,过度倾向于高收入和西方语境,导致模型泛化性不足,加剧了性能差异,尤其对低收入和非西方社区不利。Contribution: 1. 提出了功能中心的框架,重新分类对象的功能;2. 构建了Culture Affordance Atlas数据集,覆盖46种功能和288种对象;3. 通过实验证明该方法显著降低了性能差距。
Method: 通过重新标注Dollar Street数据集,以功能为中心对对象进行分类,并利用CLIP模型进行实证分析。
Result: 功能中心标注方法使高低收入群体间的性能差距中位数降低了6个百分点(统计显著)。
Insight: 功能中心的分类方法有助于构建更具包容性的数据集,提高AI系统在多样文化语境中的公平性。
Abstract: Culture shapes the objects people use and for what purposes, yet mainstream Vision-Language (VL) datasets frequently exhibit cultural biases, disproportionately favoring higher-income, Western contexts. This imbalance reduces model generalizability and perpetuates performance disparities, especially impacting lower-income and non-Western communities. To address these disparities, we propose a novel function-centric framework that categorizes objects by the functions they fulfill, across diverse cultural and economic contexts. We implement this framework by creating the Culture Affordance Atlas, a re-annotated and culturally grounded restructuring of the Dollar Street dataset spanning 46 functions and 288 objects publicly available at https://lit.eecs.umich.edu/CultureAffordance-Atlas/index.html. Through extensive empirical analyses using the CLIP model, we demonstrate that function-centric labels substantially reduce socioeconomic performance gaps between high- and low-income groups by a median of 6 pp (statistically significant), improving model effectiveness for lower-income contexts. Furthermore, our analyses reveals numerous culturally essential objects that are frequently overlooked in prominent VL datasets. Our contributions offer a scalable pathway toward building inclusive VL datasets and equitable AI systems.
cs.RO [Back]
[111] Multi-Agent Reinforcement Learning and Real-Time Decision-Making in Robotic Soccer for Virtual Environments
Aya Taourirte,Md Sohag Mia
Main category: cs.RO
TL;DR: 该论文提出了一种统一的多智能体强化学习框架,解决了复杂动态环境中的实时决策问题,并通过分层强化学习和平均场理论提升了性能与稳定性。
Details
Motivation: 现有多智能体强化学习方法在多粒度任务和大规模交互中存在不足,尤其是在机器人足球等对抗性环境中,需要实时决策和高效合作。Contribution: 1. 提出了基于PPO的实时调度基线;2. 引入了分层强化学习框架,结合半马尔可夫决策过程;3. 整合平均场理论简化多智能体交互。
Method: 1. 使用PPO作为基准方法;2. 设计分层强化学习结构(高层轨迹规划和低层动作执行);3. 结合平均场理论优化多智能体交互。
Result: 方法在4v4比赛中表现优异(平均进球5.93,控球率89.1%,传球准确率92.3%)。
Insight: 分层结构和平均场理论的结合能有效解决复杂多智能体任务中的实时决策和协作问题。
Abstract: The deployment of multi-agent systems in dynamic, adversarial environments like robotic soccer necessitates real-time decision-making, sophisticated cooperation, and scalable algorithms to avoid the curse of dimensionality. While Reinforcement Learning (RL) offers a promising framework, existing methods often struggle with the multi-granularity of tasks (long-term strategy vs. instant actions) and the complexity of large-scale agent interactions. This paper presents a unified Multi-Agent Reinforcement Learning (MARL) framework that addresses these challenges. First, we establish a baseline using Proximal Policy Optimization (PPO) within a client-server architecture for real-time action scheduling, with PPO demonstrating superior performance (4.32 avg. goals, 82.9% ball control). Second, we introduce a Hierarchical RL (HRL) structure based on the options framework to decompose the problem into a high-level trajectory planning layer (modeled as a Semi-Markov Decision Process) and a low-level action execution layer, improving global strategy (avg. goals increased to 5.26). Finally, to ensure scalability, we integrate mean-field theory into the HRL framework, simplifying many-agent interactions into a single agent vs. the population average. Our mean-field actor-critic method achieves a significant performance boost (5.93 avg. goals, 89.1% ball control, 92.3% passing accuracy) and enhanced training stability. Extensive simulations of 4v4 matches in the Webots environment validate our approach, demonstrating its potential for robust, scalable, and cooperative behavior in complex multi-agent domains.
[112] MSG-Loc: Multi-Label Likelihood-based Semantic Graph Matching for Object-Level Global Localization
Gihyeon Lee,Jungwoo Lee,Juwon Kim,Young-Sik Shin,Younggun Cho
Main category: cs.RO
TL;DR: MSG-Loc提出了一种基于多标签似然的语义图匹配框架,用于物体级别的全局定位,解决了语义歧义导致的物体误分类和错误关联问题。
Details
Motivation: 在未知物体类别和语义歧义的环境中,机器人需要准确定位,而高语义歧义会增加物体误分类和错误关联的风险,导致位姿估计严重错误。Contribution: 提出了一个多标签图表示框架,通过上下文感知的似然传播增强语义图匹配的准确性,适用于封闭集和开放集检测配置。
Method: 利用多标签图表示捕捉物体观测的语义上下文,结合节点似然及其邻域的最大似然,通过上下文感知似然传播优化语义关联。
Result: 方法在真实室内场景和合成环境中均表现出对大词汇物体类别的扩展性,显著提升了数据关联和位姿估计的准确性。
Insight: 多标签图表示和上下文感知似然传播能够有效减少语义歧义,为复杂环境中的物体级别全局定位提供新思路。
Abstract: Robots are often required to localize in environments with unknown object classes and semantic ambiguity. However, when performing global localization using semantic objects, high semantic ambiguity intensifies object misclassification and increases the likelihood of incorrect associations, which in turn can cause significant errors in the estimated pose. Thus, in this letter, we propose a multi-label likelihood-based semantic graph matching framework for object-level global localization. The key idea is to exploit multi-label graph representations, rather than single-label alternatives, to capture and leverage the inherent semantic context of object observations. Based on these representations, our approach enhances semantic correspondence across graphs by combining the likelihood of each node with the maximum likelihood of its neighbors via context-aware likelihood propagation. For rigorous validation, data association and pose estimation performance are evaluated under both closed-set and open-set detection configurations. In addition, we demonstrate the scalability of our approach to large-vocabulary object categories in both real-world indoor scenes and synthetic environments.
[113] RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL
Yinzhou Tang,Yu Shang,Yinuo Chen,Bingwen Wei,Xin Zhang,Shu’ang Yu,Liangzhi Shi,Chao Yu,Chen Gao,Wei Wu,Yong Li
Main category: cs.RO
TL;DR: RoboScape-R提出了一种基于世界模型的通用奖励机制,解决了传统强化学习中奖励信号缺乏通用性的问题,显著提升了策略的泛化能力。
Details
Motivation: 传统模仿学习和强化学习方法在跨场景泛化中存在局限性,IL容易过拟合专家轨迹,RL缺乏通用奖励信号。世界模型可作为通用环境代理,但目前仍依赖人工设计的任务特定奖励。Contribution: 提出了RoboScape-R框架,利用世界模型生成内生奖励,从而提供一个通用训练环境;设计了基于世界模型的通用奖励机制。
Method: 通过世界模型生成奖励信号,这些奖励源于模型对状态转移动力学的内在理解,避免了人工设计奖励的局限性。
Result: 实验表明,RoboScape-R显著提升了策略的泛化能力,在域外场景中平均性能提升37.5%。
Insight: 世界模型可以作为在线训练策略的核心工具,其内生奖励机制更贴近真实环境动力,为强化学习提供了一种高效的通用训练方法。
Abstract: Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ‘’endogenous’’ rewards derived from the model’s intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.
[114] Artificial Microsaccade Compensation: Stable Vision for an Ornithopter
Levi Burner,Guido de Croon,Yiannis Aloimonos
Main category: cs.RO
TL;DR: 该论文提出了一种名为“人工微扫视补偿”的方法,用于稳定由无尾扑翼飞行器拍摄的视频,克服了高频抖动问题,并在实时处理中优于Adobe Premier Pro的商业稳定软件。
Details
Motivation: 受生物微扫视现象的启发,论文旨在解决扑翼飞行器在12-20 Hz高频抖动下视频不稳定的问题。Contribution: 主要贡献是提出了一种基于SO(3)三维旋转优化的实时视频稳定方法,显著降低了帧间运动。
Method: 方法通过优化SO(3)表示的3D旋转来最小化图像强度变化,从而实现无失真的实时视频稳定。
Result: 实验结果表明,该方法在视觉质量和实时性上均优于Adobe Premier Pro的warp stabilizer。
Insight: 生物微扫视现象的工程化应用展示了高效运动补偿算法的潜力。
Abstract: Animals with foveated vision, including humans, experience microsaccades, small, rapid eye movements that they are not aware of. Inspired by this phenomenon, we develop a method for “Artificial Microsaccade Compensation”. It can stabilize video captured by a tailless ornithopter that has resisted attempts to use camera-based sensing because it shakes at 12-20 Hz. Our approach minimizes changes in image intensity by optimizing over 3D rotation represented in SO(3). This results in a stabilized video, computed in real time, suitable for human viewing, and free from distortion. When adapted to hold a fixed viewing orientation, up to occasional saccades, it can dramatically reduce inter-frame motion while also benefiting from an efficient recursive update. When compared to Adobe Premier Pro’s warp stabilizer, which is widely regarded as the best commercial video stabilization software available, our method achieves higher quality results while also running in real time.
cs.LG [Back]
[115] SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning
Salman Rahman,Sruthi Gorantla,Arpit Gupta,Swastik Roy,Nanyun Peng,Yang Liu
Main category: cs.LG
TL;DR: SPARK提出了一种无需参考的三阶段强化学习框架,通过生成器和验证器模型生成密集的步骤级奖励信号,显著提升了数学推理任务的性能。
Details
Motivation: 传统的步骤级奖励模型(PRMs)需要昂贵的标注或参考答案,限制了其应用。SPARK旨在通过生成器和验证器的协同工作解决这一问题。Contribution: 提出SPARK框架,结合生成器、验证器和合成数据训练PRM,并在数学推理任务上超越基于真实标注的方法。
Method: 采用三阶段方法:(1)生成多样解并验证;(2)用验证输出训练PRM;(3)引入PRM-CoT作为奖励信号,并防止奖励黑客行为。
Result: 在ProcessBench上达到67.5 F1,优于参考训练(66.4)和GPT-4o(61.9);数学推理任务平均准确率47.4%,超过RLVR(43.9%)。
Insight: 通过合成数据训练PRM能够超越真实标注的效果,为缺乏参考答案的领域提供了新的训练可能性。
Abstract: Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
[116] Energy-Efficient Federated Learning via Adaptive Encoder Freezing for MRI-to-CT Conversion: A Green AI-Guided Research
Ciro Benito Raggio,Lucia Migliorelli,Nils Skupien,Mathias Krohmer Zabaleta,Oliver Blanck,Francesco Cicone,Giuseppe Lucio Cascini,Paolo Zaffino,Maria Francesca Spadea
Main category: cs.LG
TL;DR: 该论文提出了一种节能联邦学习策略,通过自适应冻结编码器权重,显著减少训练时间和碳排放,同时保持MRI到CT转换的性能。
Details
Motivation: 联邦学习在医疗领域具有潜力,但高资源需求限制了资源有限机构的参与,加剧医疗不平等。因此,作者提出了一种节能方法以减少计算负担。Contribution: 主要贡献是基于Green AI的自适应层冻结策略,显著降低训练能耗和碳排放,同时不影响模型性能。
Method: 通过监测编码器权重的变化,选择性冻结权重;采用基于耐心的机制确保仅在更新较小时冻结。能耗和排放通过CodeCarbon库跟踪。
Result: 相比非冻结方法,训练时间、能耗和CO2eq排放减少达23%。MRI到CT转换性能变化微小,部分架构甚至性能提升。
Insight: 该工作为医疗AI提供了节能解决方案,推动了隐私、公平和可持续性,为绿色联邦学习评估框架奠定了基础。
Abstract: Federated Learning (FL) holds the potential to advance equality in health by enabling diverse institutions to collaboratively train deep learning (DL) models, even with limited data. However, the significant resource requirements of FL often exclude centres with limited computational infrastructure, further widening existing healthcare disparities. To address this issue, we propose a Green AI-oriented adaptive layer-freezing strategy designed to reduce energy consumption and computational load while maintaining model performance. We tested our approach using different federated architectures for Magnetic Resonance Imaging (MRI)-to-Computed Tomography (CT) conversion. The proposed adaptive strategy optimises the federated training by selectively freezing the encoder weights based on the monitored relative difference of the encoder weights from round to round. A patience-based mechanism ensures that freezing only occurs when updates remain consistently minimal. The energy consumption and CO2eq emissions of the federation were tracked using the CodeCarbon library. Compared to equivalent non-frozen counterparts, our approach reduced training time, total energy consumption and CO2eq emissions by up to 23%. At the same time, the MRI-to-CT conversion performance was maintained, with only small variations in the Mean Absolute Error (MAE). Notably, for three out of the five evaluated architectures, no statistically significant differences were observed, while two architectures exhibited statistically significant improvements. Our work aligns with a research paradigm that promotes DL-based frameworks meeting clinical requirements while ensuring climatic, social, and economic sustainability. It lays the groundwork for novel FL evaluation frameworks, advancing privacy, equity and, more broadly, justice in AI-driven healthcare.
cs.SE [Back]
[117] Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
Songwen Zhao,Danqing Wang,Kexun Zhang,Jiaxuan Luo,Zhuo Li,Lei Li
Main category: cs.SE
TL;DR: 论文提出了SU S VI B E S基准,用于评估Vibe编码(LLM代理生成代码)的安全性,发现其在高风险应用中的安全隐患。
Details
Motivation: Vibe编码是一种新兴编程范式,但其生成代码的安全性尚未被充分验证。Contribution: 提出了一个包含200个真实世界任务的基准SU S VI B E S,首次系统评估了Vibe编码的安全性。
Method: 在基准上测试了多种主流LLM代理生成代码的功能正确性和安全性,并尝试了初步安全策略。
Result: 功能正确性达61%,但安全性仅为10.5%;初步安全策略无法有效解决问题。
Insight: Vibe编码在安全敏感场景中存在重大风险,需进一步研究改进。
Abstract: Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
cs.IR [Back]
[118] M3DR: Towards Universal Multilingual Multimodal Document Retrieval
Adithya S Kolavi,Vyoman Jain
Main category: cs.IR
TL;DR: M3DR是一个多语言多模态文档检索框架,旨在解决现有系统过度依赖英语的问题,通过合成多语言数据和对比训练实现跨语言和跨模态对齐,并在22种语言上验证了其性能。
Details
Motivation: 现有的多模态文档检索系统主要针对英语,限制了其在多语言环境中的有效性。M3DR旨在填补这一空白,提升系统在多语言和多文化背景下的适用性。Contribution: M3DR提出了一个通用的多语言多模态文档检索框架,支持跨语言对齐和跨模态检索,并引入了新的基准测试和多向量检索范式。
Method: M3DR利用合成的多语言文档数据,通过对比训练学习文本和文档图像的统一表示,适用于不同的视觉-语言架构和模型规模。
Result: 在22种语言上的实验表明,M3DR能够适应语言和文字的多样性,NetraEmbed和ColNetraEmbed在跨语言检索任务上实现了150%的相对性能提升。
Insight: M3DR展示了多语言数据和灵活的检索范式在多模态检索中的重要性,为未来的多语言多模态研究提供了重要的基准和框架。
Abstract: Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.