Table of Contents
- cs.CL [Total: 21]
- cs.CV [Total: 87]
- eess.AS [Total: 1]
- cs.MM [Total: 2]
- cs.RO [Total: 3]
- eess.IV [Total: 2]
- cs.LG [Total: 5]
- cs.SD [Total: 1]
- cs.AI [Total: 3]
- cs.CY [Total: 2]
- q-bio.QM [Total: 1]
cs.CL [Back]
[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability
Hen-Hsen Huang
Main category: cs.CL
TL;DR: 这篇论文提出了一种新的研究方向,旨在将大型语言模型(LLM)的高效优化方法从超大规模提供商扩展到普通机构,强调在不牺牲资源和专业知识的前提下实现模型的简单性和鲁棒性。
Details
Motivation: 当前的高效优化方法(如混合专家模型、推测解码和复杂检索增强生成)仅适用于资源丰富的超大规模提供商,而普通机构无法受益。论文呼吁一种更简单、更通用的方法,以降低部署成本和碳排放。Contribution: 论文提出了一个新的研究议程,包括对预训练模型的高效架构改进、轻量级微调方法、经济高效的推理技术、动态知识管理以及引入‘开销感知效率’(OAE)作为新基准。
Method: 方法包括改造预训练模型的高效架构(无需重新训练)、设计轻量级微调技术、优化推理链的经济性,以及实现无需复杂管线的动态知识管理。
Result: 论文未提及具体实验结果,但倡导通过新研究议程降低LLM的部署门槛和碳排放。
Insight: 高效不应仅针对超大规模提供商,而应考虑普及性和可持续性。通过重新定义效率标准,可以缩小资源差距并减少碳足迹。
Abstract: Large language models (LLMs) have become indispensable, but the most celebrated efficiency methods – mixture-of-experts (MoE), speculative decoding, and complex retrieval-augmented generation (RAG) – were built for hyperscale providers with vast infrastructure and elite teams. Outside that context, their benefits collapse into overhead, fragility, and wasted carbon. The result is that a handful of Big Tech companies benefit, while thousands of hospitals, schools, governments, and enterprises are left without viable options. We argue that the next frontier is not greater sophistication at scale, but robust simplicity: efficiency that thrives under modest resources and minimal expertise. We propose a new research agenda: retrofitting pretrained models with more efficient architectures without retraining, inventing lightweight fine-tuning that preserves alignment, making reasoning economical despite long chains of thought, enabling dynamic knowledge management without heavy RAG pipelines, and adopting Overhead-Aware Efficiency (OAE) as a standard benchmark. By redefining efficiency to include adoption cost, sustainability, and fairness, we can democratize LLM deployment – ensuring that optimization reduces inequality and carbon waste rather than amplifying them.
[2] A centroid based framework for text classification in itsm environments
Hossein Mohanna,Ali Ait-Bachir
Main category: cs.CL
TL;DR: 论文提出了一种基于双嵌入质心的文本分类框架,适用于IT服务管理(ITSM)中的多层分类任务,结合语义和词法表示,效率高且可解释性强。
Details
Motivation: ITSM环境中需要对支持工单进行多层分类,现有方法(如支持向量机)在效率和可解释性上存在不足。Contribution: 1. 提出了双嵌入质心框架,支持语义和词法表示的综合利用。2. 提供了高效的训练和增量更新能力。3. 分类性能与SVM相当,同时具备更好的可解释性。
Method: 采用双嵌入质心表示(语义和词法),并通过逆序融合(reciprocal rank fusion)在推理时结合两者。
Result: 在8,968个ITSM工单上测试,F1分数0.731(优于SVM的0.727),训练速度提升5.9倍,增量更新速度提升152倍,批量处理速度提升8.6-8.8倍。
Insight: 结合语义和词法信息可以平衡分类性能和可解释性,双质心方法尤其适合需要高效和透明的生产环境。
Abstract: Text classification with hierarchical taxonomies is a fundamental requirement in IT Service Management (ITSM) systems, where support tickets must be categorized into tree-structured taxonomies. We present a dual-embedding centroid-based classification framework that maintains separate semantic and lexical centroid representations per category, combining them through reciprocal rank fusion at inference time. The framework achieves performance competitive with Support Vector Machines (hierarchical F1: 0.731 vs 0.727) while providing interpretability through centroid representations. Evaluated on 8,968 ITSM tickets across 123 categories, this method achieves 5.9 times faster training and up to 152 times faster incremental updates. With 8.6-8.8 times speedup across batch sizes (100-1000 samples) when excluding embedding computation. These results make the method suitable for production ITSM environments prioritizing interpretability and operational efficiency.
[3] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data
Mann Khatri,Mirza Yusuf,Rajiv Ratn Shah,Ponnurangam Kumaraguru
Main category: cs.CL
TL;DR: 该论文研究了如何在零样本设置下改进大型语言模型(LLMs)在法律领域的推理能力,特别是在印度法律判决预测任务中。通过结构化文档、定义法律术语和模拟法院推理步骤,模型的性能显著提升。
Details
Motivation: LLMs在通用领域表现出色,但在法律等专业领域表现不佳,主要因为缺乏领域特定的预训练和法律文档的复杂性。本文旨在通过结构化方法和领域知识增强提升模型的法律推理能力。Contribution: 主要贡献包括:(1)重组法律文档以结构化信息,(2)定义法律修辞角色以增强术语理解,(3)模拟法院的逐步推理步骤。这些方法显著提升了模型在法律任务中的表现。
Method: 通过实验研究了三种方法:(1)重组文档以结构化信息,(2)定义法律术语,(3)模拟法院的推理步骤。实验在印度法律判决预测数据集上进行,采用零样本设置。
Result: 实验结果表明,结构化数据和定义法律术语能显著提升模型性能,F1分数最低提升约1.5%,最高提升4.36%。
Insight: 结构化信息和领域知识增强在提升LLMs专业领域表现中起关键作用,尤其对于复杂的长文档任务。
Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.
[4] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes
Matthew W. Kenaston,Umair Ayub,Mihir Parmar,Muhammad Umair Anjum,Syed Arsalan Ahmed Naqvi,Priya Kumar,Samarth Rawal,Aadel A. Chaudhuri,Yousef Zakharia,Elizabeth I. Heath,Tanios S. Bekaii-Saab,Cui Tao,Eliezer M. Van Allen,Ben Zhou,YooJung Choi,Chitta Baral,Irbaz Bin Riaz
Main category: cs.CL
TL;DR: 论文研究了大型语言模型在临床肿瘤学笔记推理中的认知偏差问题,发现尽管模型在基准测试中表现优异,但通过错误推理得出的结论可能导致临床决策不安全。
Details
Motivation: 大型语言模型在临床肿瘤学领域的应用中,推理错误可能导致错误的临床建议,从而影响患者安全。这一问题未被传统的准确性评估所捕获,亟需深入研究。Contribution: 提出了一个三层次的分层分类法,用于映射语言模型的推理错误与认知偏差框架之间的关系,并通过实验验证了其临床相关性。
Method: 使用CORAL数据集中的乳腺癌和胰腺癌笔记,标注600条推理轨迹以定义分类法,并在822条前列腺癌咨询笔记的响应中进行验证。
Result: 23%的解释中存在推理错误,主导了总体错误,最常见的是确认偏误和锚定偏误,这些错误与指南不一致且有潜在危害的建议相关。
Insight: 大型语言模型的流畅输出可能掩盖了其推理缺陷,强调了在临床部署前评估和改进推理可靠性的重要性。自动化评估工具目前无法可靠分类错误子类型。
Abstract: Despite high performance on clinical benchmarks, large language models may reach correct conclusions through faulty reasoning, a failure mode with safety implications for oncology decision support that is not captured by accuracy-based evaluation. In this two-cohort retrospective study, we developed a hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes and tested its clinical relevance. Using breast and pancreatic cancer notes from the CORAL dataset, we annotated 600 reasoning traces to define a three-tier taxonomy mapping computational failures to cognitive bias frameworks. We validated the taxonomy on 822 responses from prostate cancer consult notes spanning localized through metastatic disease, simulating extraction, analysis, and clinical recommendation tasks. Reasoning errors occurred in 23 percent of interpretations and dominated overall errors, with confirmation bias and anchoring bias most common. Reasoning failures were associated with guideline-discordant and potentially harmful recommendations, particularly in advanced disease management. Automated evaluators using state-of-the-art language models detected error presence but could not reliably classify subtypes. These findings show that large language models may provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a generalizable framework for evaluating and improving reasoning fidelity before clinical deployment.
[5] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models
Asad Aali,Muhammad Ahmed Mohsin,Vasiliki Bikia,Arnav Singhvi,Richard Gaus,Suhana Bedi,Hejie Cui,Miguel Fuentes,Alyssa Unell,Yifan Mai,Jordan Cahoon,Michael Pfeffer,Roxana Daneshjou,Sanmi Koyejo,Emily Alsentzer,Percy Liang,Christopher Potts,Nigam H. Shah,Akshay S. Chaudhari
Main category: cs.CL
TL;DR: 本文提出了一种结合DSPy和HELM的新框架,通过结构化提示方法更准确地评估语言模型性能。研究发现传统HELM框架低估了模型性能,而结构化提示能减少评估偏差并提升结果的一致性。
Details
Motivation: 随着语言模型的广泛应用,传统评估框架(如HELM)依赖固定提示,导致性能估计不准确。需要一种更鲁棒的方法来评估模型的真实潜力。Contribution: 1. 提出了DSPy+HELM框架,首次大规模评估结构化提示方法的有效性;2. 发现传统方法低估模型性能且结果不稳定;3. 开源了集成工具和提示优化流程。
Method: 结合DSPy的声明式提示框架和HELM基准测试,使用四种提示方法(包括链式思维)评估四个前沿语言模型在七个任务上的表现。
Result: 1. HELM低估模型性能4%;2. 结构化提示减少了性能方差(标准差降低2%);3. 引入链式思维后模型对提示设计的敏感性降低。
Insight: 结构化提示不仅能更准确地估计模型上限,还能减少评估中的不稳定性,为实际部署提供更有价值的基准。
Abstract: As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM’s ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).
[6] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Tianxin Wei,Noveen Sachdeva,Benjamin Coleman,Zhankui He,Yuanchen Bei,Xuying Ning,Mengting Ai,Yunzhe Li,Jingrui He,Ed H. Chi,Chi Wang,Shuo Chen,Fernando Pereira,Wang-Cheng Kang,Derek Zhiyuan Cheng
Main category: cs.CL
TL;DR: 论文《Evo-Memory》提出了一个用于评估LLM代理自我进化记忆能力的流式基准和框架,填补了现有评估在动态任务流中记忆积累与重用能力的空白。
Details
Motivation: 现有评估主要关注静态对话场景,忽视了LLM在连续任务流中记忆的动态积累与重用能力。为此,论文旨在解决LLM代理在处理动态环境时需要持续学习和记忆更新的问题。Contribution: 1. 提出了Evo-Memory,一个全面的流式基准与框架,用于评估LLM代理的记忆自我进化能力;2. 实现了十多种代表性记忆模块的统一评估,并提出基线方法ExpRAG和ReMem以实现持续改进。
Method: 1. 将数据集构建为序列任务流,要求LLM在每次交互后搜索、适应并进化记忆;2. 提出ExpRAG用于检索和利用先验经验;3. 设计ReMem框架,通过推理、任务行动和记忆更新的紧密集成实现持续改进。
Result: 论文在10个多样化的多轮目标导向和单轮推理QA数据集上评估了LLM的记忆进化能力,展示了ReMem方法在处理动态任务流中的有效性。
Insight: LLM代理的动态记忆管理是实现长期规划和问题解决的关键,而当前的静态评估方法不足以反映真实场景的需求。Evo-Memory为此提供了一个重要的基准和方法论支持。
Abstract: Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.
[7] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation
Ali Jahan,Masood Ghayoomi,Annette Hautli-Janisz
Main category: cs.CL
TL;DR: 论文研究了跨语言方法在低资源语言(波斯语)论点挖掘中的应用,通过三种训练场景测试,发现跨语言模型优于零样本迁移和LLM增强方法。
Details
Motivation: 解决低资源语言(如波斯语)在论点挖掘任务中数据不足的问题,探索跨语言方法的有效性。Contribution: 提出跨语言模型,证明其在低资源语言中的优越性,超越零样本迁移和LLM增强方法。
Method: 设计了三种训练场景:零样本迁移、LLM增强和跨语言混合模型,并在英语和波斯语数据上评估。
Result: 跨语言模型在波斯语测试集上F1得分74.8%,优于LLM增强(69.3%)和零样本迁移(50.7%)。
Insight: 轻量级的跨语言混合模型比资源密集的LLM增强方法更有效,为低资源语言的论点挖掘提供了实用路径。
Abstract: Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2% on the English test set and 50.7% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2% on English and 69.3% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.
[8] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Yuhang Wang,Yanxu Zhu,Dongyuan Lu,Jitao Sang
Main category: cs.CL
TL;DR: 本文提出了SGASA框架,通过内部生成的安全指南增强推理模型的防御能力,以应对对抗性攻击,同时减少对良性请求的误拒。
Details
Motivation: 由于对抗性提示的隐蔽性和欺骗性,现有安全机制常被绕过,导致有害内容生成。需要一种自适应的方法,使模型能自主强化防御。Contribution: 提出了SGASA框架,结合数据预合成和对齐微调,显著提升了模型的安全性和适应性。
Method: SGASA包括数据预合成(生成安全指南和增强提示)和对齐微调(利用SFT和DPO将指南嵌入模型)。
Result: 多数据集实验表明SGASA显著提升模型安全性,验证了其适应性和可扩展性。
Insight: 模型通过内部生成的安全指南可自主增强防御能力,同时减少对良性请求的误拒,为安全对齐提供了新思路。
Abstract: Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models’ ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
[9] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?
Steven Wang,Kyle Hunt,Shaojie Tang,Kenneth Joseph
Main category: cs.CL
TL;DR: 本文探讨了在小规模人类样本上微调大型语言模型(LLM)是否能提升模型的多样性、对齐性和信念-行动一致性,发现微调虽能改善这些问题,但仍无法完全替代人类参与者。
Details
Motivation: 当前研究中,大型语言模型是否能替代人类参与者在实验和调查中存在争议,尤其是模型的多样性、对齐性和信念-行动一致性不足的问题。本文试图通过小规模人类样本微调来解决这些问题。Contribution: 研究表明,微调能显著提升LLM的多样性、对齐性和信念-行动一致性,但也揭示了LLM生成数据在正式推理分析中仍无法替代人类数据的局限性。
Method: 通过行为实验(信息泄露研究),比较人类和LLM生成的数据在多维度上的差异,包括分布差异、子群对齐性、信念-行动一致性以及回归系数的恢复能力。
Result: 微调显著改善了模型的多样性、对齐性和信念-行动一致性,但未能恢复原始研究的回归系数,表明LLM数据不适合用于正式的推理分析。
Insight: 尽管微调在小范围内提升了LLM的表现,但其生成的模拟数据仍存在局限性,不适合完全替代人类参与者,尤其是在复杂的统计推理任务中。
Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.
[10] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation
Zhifeng Hao,Qibin Song,Ruichu Cai,Boyan Xu
Main category: cs.CL
TL;DR: DSR-SQL引入了一种双状态推理框架,通过自适应上下文状态和渐进生成状态的交互,解决了复杂企业数据库中Text-to-SQL任务面临的上下文容量不足、模式链接不可靠和数据库语义基础薄弱的问题。
Details
Motivation: 现有的基于分治推理方法(如Chain-of-Thought)在处理复杂企业数据库时,由于上下文容量有限、模式链接不可靠以及缺乏对数据库语义的强基础支持,难以保持连贯的推理。Contribution: 提出了DSR-SQL,一种双状态推理框架,将Text-to-SQL任务建模为自适应上下文状态和渐进生成状态的交互,从而提升了复杂数据库场景下的性能。
Method: DSR-SQL通过第一个状态(自适应上下文状态)构建紧凑且语义忠实的环境,第二个状态(渐进生成状态)将SQL合成形式化为反馈引导的状态转换,实现自我校正和对用户意图的对齐。
Result: 在没有后训练或上下文示例的情况下,DSR-SQL在Spider 2.0-Snow上达到了35.28%的执行准确率,在BIRD开发集上达到了68.32%的准确率。
Insight: 双状态推理框架不仅解决了现有方法的局限性,还展示了如何在复杂数据库场景中实现高效的Text-to-SQL转换。
Abstract: Recent divide-and-conquer reasoning approaches, particularly those based on Chain-of-Thought (CoT), have substantially improved the Text-to-SQL capabilities of Large Language Models (LLMs). However, when applied to complex enterprise databases, such methods struggle to maintain coherent reasoning due to limited context capacity, unreliable schema linking, and weak grounding in database semantics. To overcome these issues, we introduce DSR-SQL, a \textbf{D}ual-\textbf{S}tate \textbf{R}easoning framework that models Text-to-SQL as an interaction between an adaptive context state and a progressive generation state. The first constructs a compact, semantically faithful environment by refining large schemas and selecting relevant structures, while the second formalizes SQL synthesis as feedback-guided state transitions, enabling the model to self-correct and align with user intent. Without any post-training or in-context examples, DSR-SQL achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set. Our implementation will be open-sourced at: https://github.com/DMIRLAB-Group/DSR-SQL.
[11] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning
Kaifeng Hong,Yinglong Zhang,Xiaoying Hong,Xuewen Xia,Xing Xu
Main category: cs.CL
TL;DR: Odin是一种新颖的架构,通过定向双模块机制在特定Transformer层注入图结构,避免了传统GNN的过平滑问题和Transformer忽略拓扑结构的问题,并在多文本图中达到SOTA性能。
Details
Motivation: 现有方法中,GNN受限于过平滑和跳数依赖扩散,而Transformer忽略了图拓扑结构,Odin旨在结合两者的优势,解决这些问题。Contribution: 提出Odin架构,通过定向双模块机制在图结构中注入Transformer层,避免了GNN的局限性;提出轻量版Light Odin,保持性能的同时降低计算成本。
Method: Odin通过在全球[CLS]表示上进行聚合,避免过平滑;Light Odin保留了层对齐的结构抽象,提升了训练和推理效率。
Result: 在多个文本图中达到SOTA性能,Light Odin在显著降低计算成本的同时保持竞争力。
Insight: Odin展示了结构注入Transformer的可行性,为文本-结构融合提供了无跳依赖的统一框架。
Abstract: Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs–limited by over-smoothing and hop-dependent diffusion–or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model’s semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin’s expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.
[12] A Systematic Study of Model Merging Techniques in Large Language Models
Oğuz Kağan Hitit,Leander Girrbach,Zeynep Akata
Main category: cs.CL
TL;DR: 该论文系统地研究了大型语言模型(LLMs)中的模型合并技术,发现简单的方法Task Arithmetic在小模型和分类器中表现良好,但在LLMs中其他复杂方法效果不佳。
Details
Motivation: 模型合并是一种无需额外训练即可组合多个微调模型的方法,但现有技术是否适用于LLMs尚不明确。Contribution: 通过大规模评估六种先进的合并方法,揭示了Task Arithmetic是唯一在LLMs中稳定的方法。
Method: 评估了六种合并方法在四种LLMs和十六个基准测试上的表现。
Result: 结果表明其他方法会导致性能下降,只有Task Arithmetic可靠提升性能。
Insight: 需要开发针对LLMs的合并算法和合并感知的微调方法。
Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.
[13] Hierarchical Ranking Neural Network for Long Document Readability Assessment
Yurui Zheng,Yijun Chen,Shaohong Zhang
Main category: cs.CL
TL;DR: 论文提出了一种双向可读性评估机制,结合上下文信息预测句子级别的可读性标签,并通过标签差值建模可读性等级的序数关系,提升了长文档的可读性评估性能。
Details
Motivation: 现有的深度学习方法在可读性评估中常忽略文本长度或可读性标签的序数关系,导致长文档评估效果不佳。论文旨在解决这一问题。Contribution: 1. 提出了双向可读性评估机制,结合句子级和文档级信息;2. 引入标签差值算法建模序数关系;3. 在中英文数据集上验证了方法的有效性。
Method: 1. 使用双向机制捕获上下文信息,预测句子级可读性;2. 通过排序算法建模标签的序数关系;3. 结合句子级标签辅助文档级预测。
Result: 实验表明模型在中英文数据集上表现优于基线方法,验证了其有效性。
Insight: 结合句子级信息和文档级信息能提升长文档的可读性评估;建模标签序数关系有助于提高模型性能。
Abstract: Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.
[14] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation
Lina Conti,Dennis Fucci,Marco Gaido,Matteo Negri,Guillaume Wisniewski,Luisa Bentivogli
Main category: cs.CL
TL;DR: 本文研究了语音翻译(ST)中性别分配的机制,探讨了训练数据模式、内部语言模型(ILM)偏差和声学信息之间的相互作用。研究发现模型并非简单复制训练数据中的性别关联,而是学习更广泛的男性主导模式。声学输入可以覆盖ILM的偏好。高准确率模型利用第一人称代词将性别信息分散到频谱中。
Details
Motivation: 语音翻译中的性别分配可能导致误判,但目前对这一机制的了解不足。研究旨在揭示ST模型如何结合声学和语言信息进行性别分配。Contribution: 揭示了ST模型性别分配的新机制:利用第一人称代词链接频谱中的性别信息,而非依赖单一的基频特征。
Method: 通过对比特征归因(contrastive feature attribution)分析频谱,研究了ILM偏差、训练数据和声学信息的交互作用。
Result: 模型能够通过声学输入覆盖ILM的男性偏好,且高准确率模型依赖分散在频谱中的性别信息。
Insight: 性别分配不仅依赖于基频,还可能通过上下文和第一人称代词动态调整,揭示了模型的复杂性。
Abstract: Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker’s vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.
[15] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions
Minjoon Choi
Main category: cs.CL
TL;DR: RoParQ是一个专注于评估大语言模型(LLMs)在回答重述问题时的跨重述一致性基准,提出了XParaCon新指标来衡量模型鲁棒性,并通过基于推理的监督微调策略提升模型的语义不变性。
Details
Motivation: 当前LLMs在回答重述问题时表现出不一致性,表明它们依赖表面模式而非语义理解。这限制了模型的鲁棒性和可靠性。Contribution: 1) 提出了RoParQ基准评测跨重述一致性;2) 设计了XParaCon指标量化鲁棒性;3) 提出基于推理的监督微调策略,显著提升了模型的语义一致性。
Method: 1) 通过专有模型生成重述问题构建RoParQ基准;2) 引入XParaCon指标计算各问题变体的准确率标准差;3) 采用推理驱动的监督微调策略(SFT)提升模型的语义不变性。
Result: 实验显示,经过微调的轻量级模型在一致性上达到了接近更大预训练模型的水平,证明了方法的有效性。
Insight: 通过针对性微调和语义对齐,可以有效减少LLMs对表面模式的依赖,提升其语义理解和鲁棒性。
Abstract: Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model’s robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.
[16] Auxiliary Metrics Help Decoding Skill Neurons in the Wild
Yixiu Zhao,Xiaozhi Wang,Zijun Yao,Lei Hou,Juanzi Li
Main category: cs.CL
TL;DR: 该论文提出了一种轻量级方法,通过辅助指标(如外部标签和模型置信度)来解码大型语言模型中的“技能神经元”,揭示了任务特异性行为,并在多样任务中验证了其有效性。
Details
Motivation: 尽管大型语言模型(LLMs)表现出广泛的任务能力,但其内部机制仍不透明。研究旨在通过识别编码特定技能的神经元(技能神经元),提升对模型行为的可解释性。Contribution: 论文提出了一种新方法,利用辅助指标(如外部标签和模型置信度)解码技能神经元,揭示了复杂任务中的新型捷径行为。
Method: 基于软提示训练的方法扩展至多技能场景,通过神经元激活与辅助指标的相关性分析,无需手动标记聚合,即可识别任务特异性行为。
Result: 在开放文本生成和自然语言推理任务中验证了方法的有效性,发现了算术推理任务中的新型捷径行为。
Insight: 辅助指标可以高效地解码技能神经元,揭示模型的隐藏行为和捷径策略,为模型可解释性提供了新工具。
Abstract: Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified “skill neurons” via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics – such as external labels and the model’s own confidence score – thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.
[17] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
Dongyang Fan,Diba Hashemi,Sai Praneeth Karimireddy,Martin Jaggi
Main category: cs.CL
TL;DR: 论文探讨了在大型语言模型(LLM)预训练中引入多样化的元数据(metadata)以提升效率,研究发现除URL外,其他细粒度的元数据(如文档质量指标)也能加速训练。通过元数据追加和可学习的元标记(meta-tokens)等方法,进一步提高了训练效率,并通过分析潜在表征揭示了元数据如何影响学习。
Details
Motivation: 当前LLM预训练中,元数据的利用仅限于URL信号,而其他形式的元数据可能带来更大收益。研究旨在探索多样化的元数据及其对预训练效率和效果的潜在影响。Contribution: 1. 识别了多种有效的元数据类型(如文档质量指标);2. 提出元数据追加和可学习元标记的方法;3. 通过表征分析揭示了元数据如何塑造学习过程。
Method: 1. 实验多种元数据类型(如URL、文档质量等);2. 引入元数据追加作为辅助任务;3. 使用可学习元标记并通过掩码损失训练;4. 探针分析潜在表征。
Result: 细粒度的元数据(如文档质量)能显著加速预训练;元数据追加和可学习元标记进一步提高了效率;表征分析表明元数据能够诱导质量感知的潜在结构。
Insight: 元数据的有效性与其细粒度信息编码能力相关;预测元数据作为辅助任务可提升训练效率;潜在表征分析为元数据整合提供了理论支持。
Abstract: Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
[18] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry
Anna Marklová,Ondřej Vinš,Martina Vokáčová,Jiří Milička
Main category: cs.CL
TL;DR: 这篇论文研究了捷克语AI生成诗歌与人类写作诗歌的感知差异,发现捷克语母语者难以区分两者,且对AI诗歌的审美评价存在偏见。
Details
Motivation: 大型语言模型在多语言创作中的能力尚未充分研究,尤其在形态复杂的低资源语言(如捷克语)上。本文旨在填补这一空白。Contribution: 1. 验证了AI能生成与人类诗歌难以区分的捷克语诗歌;2. 揭示了读者对AI诗歌的审美偏见与实际评价的差异。
Method: 通过让捷克语母语者识别和评价AI与人类诗歌,并结合逻辑回归模型分析数据和偏见。
Result: 参与者识别准确率仅为45.8%;AI诗歌在无偏见情况下评价更高,但标注为AI时会受到更低评价。
Insight: 读者的作者信念与审美评价紧密相关,即使在低资源语言中,AI也能生成高质量创作。
Abstract: Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English – a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers’ beliefs about authorship and the aesthetic evaluation of the poem are interconnected.
[19] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
Dong Wang,Yang Li,Ansong Ni,Ching-Feng Yeh,Youssef Emad,Xinjie Lei,Liam Robbins,Karthik Padthe,Hu Xu,Xian Li,Asli Celikyilmaz,Ramya Raghavendra,Lifei Huang,Carole-Jean Wu,Shang-Wen Li
Main category: cs.CL
TL;DR: Matrix是一个去中心化的多智能体合成数据生成框架,通过分布式队列和轻量级智能体实现高效协作,相比现有方法具有更高的吞吐量和灵活性。
Details
Motivation: 现有的多智能体合成数据框架通常依赖集中式协调器,存在扩展性瓶颈,且缺乏灵活性。Contribution: 提出了Matrix框架,采用去中心化设计,通过分布式队列传递序列化消息,消除集中协调器,实现高效协作和数据生成。
Method: 框架基于Ray实现,将控制和数据流表示为序列化消息并通过分布式队列传递,轻量级智能体独立处理任务,计算密集型操作由分布式服务处理。
Result: 在多种合成场景下,Matrix的数据生成吞吐量提高了2-15倍,且不牺牲输出质量。
Insight: 去中心化设计和模块化架构能够显著提升多智能体协作的效率和数据生成的灵活性。
Abstract: Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$–$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.
[20] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Hongjin Su,Shizhe Diao,Ximing Lu,Mingjie Liu,Jiacheng Xu,Xin Dong,Yonggan Fu,Peter Belcak,Hanrong Ye,Hongxu Yin,Yi Dong,Evelina Bakhturina,Tao Yu,Yejin Choi,Jan Kautz,Pavlo Molchanov
Main category: cs.CL
TL;DR: ToolOrchestra提出了一种通过小型协调器管理模型和工具的方法,提高了智能上限和任务解决效率。
Details
Motivation: 大型语言模型虽强大,但在解决复杂问题时仍面临计算成本和性能的双重挑战。Contribution: 提出了ToolOrchestra方法,通过强化学习训练小型协调器(8B模型),在性能、效率和用户偏好之间实现最优平衡。
Method: ToolOrchestra采用强化学习,结合结果、效率和用户偏好奖励,协调智能工具的使用。
Result: 在多个基准测试中,Orchestrator性能优于GPT-5,同时成本降低30%-2.5倍,表现出鲁棒性。
Insight: 轻量级协调器结合多样化工具,比现有方法更高效且更具扩展性,为智能系统提供了新思路。
Abstract: Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity’s Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
[21] Revisiting Generalization Across Difficulty Levels: It’s Not So Easy
Yeganeh Kordi,Nihal V. Nayak,Max Zuo,Ilana Nguyen,Stephen H. Bach
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLMs)在不同任务难度下的泛化能力,发现训练数据或测试数据的难度选择对模型性能的影响不一致,且跨难度泛化有限。
Details
Motivation: 现有研究对训练数据的难度(易或难)是否对模型性能有显著影响存在争议,且结果不一致。作者希望通过更客观、大规模和细粒度的分析,解决这一问题。Contribution: 论文提出了基于大量LLMs输出和项目反应理论(IRT)的任务难度评估方法,排除了人为主观因素,并系统地评估了跨难度泛化的局限性。
Method: 作者使用数千种不同LLMs的输出和IRT,对六个数据集中的样本进行难度排名,并通过实验分析了训练数据难度对模型性能的影响。
Result: 结果显示,训练数据或测试数据的难度选择无法在所有难度范围内实现一致的性能提升,跨难度泛化能力有限。
Insight: 研究表明,为了优化LLMs的性能,训练和评估数据应包含多样化的难度,简单依赖单一难度数据不可取。
Abstract: We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs’ generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.
cs.CV [Back]
[22] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?
David Amebley,Sayanton Dibbo
Main category: cs.CV
TL;DR: 该论文研究了黑盒隐私攻击(成员推理攻击,MIA)在多模态视觉语言模型(VLMs)中的影响,并提出了一种神经科学启发的拓扑正则化框架(tau)来提高模型的隐私防护能力。实验表明,该方法显著降低了攻击成功率,同时保持了模型性能。
Details
Motivation: 随着多模态模型的广泛应用,隐私泄露风险增加。现有研究主要关注单模态系统的隐私攻击,而多模态模型的隐私防护尚未充分探索。作者受神经科学启发,探索了拓扑正则化对提高多模态模型隐私防护能力的作用。Contribution: 1.提出了神经科学启发的拓扑正则化框架(tau),用于增强多模态模型的隐私防护能力。2.实验验证了该方法在降低成员推理攻击成功率的同时,不影响模型性能。
Method: 使用拓扑正则化框架(tau)对多模态视觉语言模型(BLIP、PaliGemma 2、ViT-GPT2)进行训练,并通过COCO、CC3M和NoCaps数据集评估其隐私防护能力。
Result: 实验结果显示,NEURO VLMs的MIA攻击成功率平均降低了24%(ROC-AUC),同时模型性能(MPNet和ROUGE-2指标)未显著下降。
Insight: 神经科学启发的拓扑正则化能够有效提高多模态模型的隐私防护能力,为多模态模型的隐私安全问题提供了新思路。
Abstract: In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.
[23] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Inferix Team,Tianyu Feng,Yizeng Han,Jiahao He,Yuanyu He,Xi Lin,Teng Liu,Hanfeng Lu,Jiasheng Tang,Wei Wang,Zhiyuan Wang,Jichao Wu,Mingyang Yang,Yinghao Yu,Zeyu Zhang,Bohan Zhuang
Main category: cs.CV
TL;DR: Inferix是一种基于块扩散(block-diffusion)的下一代推理引擎,专注于高质量、沉浸式的世界模拟,通过优化半自回归解码过程克服了传统视频扩散模型的限制,支持交互式视频流和精细评估。
Details
Motivation: 世界模型在AI、游戏等领域有广泛应用,但现有视频扩散模型在生成长序列时存在连贯性和效率问题。Inferix旨在通过半自回归解码方法解决这些问题,推动世界模型的发展。Contribution: 提出了Inferix,一种基于块扩散的推理引擎,优化了半自回归解码过程,支持高效生成长视频;引入LV-Bench评估基准,推动世界模型的标准化评估。
Method: 采用半自回归解码(块扩散),结合扩散和自回归方法的优势,通过分块生成视频token并利用KV Cache管理提升效率。
Result: Inferix能够生成更连贯、稳定的长视频序列,并支持实时交互和精细评估,为世界模型提供了高效的工具。
Insight: 块扩散方法有望成为未来世界模型的核心技术,结合交互式设计和精细评估将推动AI在模拟真实世界中的进一步发展。
Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.
[24] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?
Kun Guo,Yun Shen,Xijun Wang,Chaoqun You,Yun Rui,Tony Q. S. Quek
Main category: cs.CV
TL;DR: 该论文探讨了在移动边缘网络中视频对象识别的两种策略:本地跟踪与边缘检测。通过长期优化问题和深度强化学习,提出了自适应选择策略LTED-Ada,并在单设备和多设备场景下验证了其优越性。
Details
Motivation: 资源受限设备(如交通摄像头)难以实现快速准确的视频对象识别。移动边缘计算提供了将计算密集型任务卸载到边缘服务器的可能性,但需要在本地跟踪和边缘检测之间动态选择策略。Contribution: 1) 将视频对象识别问题建模为两种长期优化问题;2) 提出LTED-Ada算法,通过深度强化学习自适应选择策略;3) 在多设备场景下结合联邦学习改进算法。
Method: 1) 在单设备场景下,LTED-Ada通过深度强化学习动态选择本地跟踪或边缘检测;2) 在多设备场景下,用联邦学习协作训练策略以提高泛化能力。
Result: 实验显示LTED-Ada在帧率、识别准确性和延迟要求方面优于其他方法。
Insight: 结合本地跟踪的速度优势和边缘检测的准确性,动态选择策略能显著提升视频对象识别的效率和性能。
Abstract: Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.
[25] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving
Haibo HU,Lianming Huang,Nan Guan,Chun Jason Xue
Main category: cs.CV
TL;DR: 论文提出了DeeAD框架,通过动态早期退出机制加速视觉-语言-动作(VLA)模型的推理,显著减少了延迟,同时保持了规划质量和安全性。
Details
Motivation: 当前VLA模型在自动驾驶中统一了感知、推理和轨迹生成,但由于深层Transformer堆栈导致推断延迟较高,限制了实际应用效率。Contribution: 提出了一种无需训练的、动作引导的早期退出框架DeeAD,通过评估中间轨迹的物理可行性来加速VLA规划。
Method: DeeAD在轨迹预测与轻量级规划先验(如导航或低精度规划)吻合(偏差<2米)时提前终止推理,并使用多跳控制器自适应跳过冗余层。
Result: 在Bench2Drive基准测试中,DeeAD实现了高达28%的Transformer层稀疏性和29%的延迟降低,同时保持了规划质量和安全。
Insight: 通过动态评估轨迹可行性而非依赖置信度分数,可以在保持性能的同时显著提高VLA模型的推理效率。
Abstract: Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.
[26] Foundry: Distilling 3D Foundation Models for the Edge
Guillaume Letellier,Siddharth Srivastava,Frédéric Jurie,Gaurav Sharma
Main category: cs.CV
TL;DR: 论文提出了Foundation Model Distillation (FMD)方法,用于压缩大模型为高效的小模型Foundry,保留了通用表征能力,适用于边缘设备。
Details
Motivation: 大规模自监督学习的Foundation模型在边缘设备上部署困难,现有压缩技术会牺牲模型的通用性,因此需要一种既能压缩又能保留通用能力的方法。Contribution: 提出了FMD新范式,实现了首个3D点云的压缩模型Foundry;通过SuperTokens重构教师模型的表征,保留了通用性和高效性。
Method: 训练学生学习一组压缩的SuperTokens,用于重构教师模型的表征,从而捕捉其潜在空间的紧凑基。
Result: 在分类、部分分割和少样本任务中表现接近大模型,但使用了更少的token和计算资源。
Insight: 通过压缩SuperTokens,可以在保持模型通用性的同时显著减少计算开销,适合边缘设备部署。
Abstract: Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient ‘specialist’ models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.
[27] DinoLizer: Learning from the Best for Generative Inpainting Localization
Minh Thong Doi,Jan Butora,Vincent Itier,Jérémie Boulanger,Patrick Bas
Main category: cs.CV
TL;DR: DinoLizer是一个基于DINOv2的模型,用于定位生成性修复图像中的篡改区域,通过结合Vision Transformer的补丁嵌入和滑动窗口策略,在多种生成模型的数据集上表现优于现有方法。
Details
Motivation: 生成性修复技术(如深度学习模型)制作的图像篡改越来越难以检测,现有的定位方法难以应对复杂的语义变化和非语义编辑。因此,需要一种能够高效检测篡改区域的方法。Contribution: 提出了DinoLizer,一种基于DINOv2的模型,通过在Vision Transformer的补丁嵌入上添加线性分类头,实现了对篡改区域的高精度定位,并在多个数据集上超越了现有方法。
Method: 利用预训练的DINOv2模型检测合成图像,添加线性分类头预测14×14补丁分辨率下的篡改区域。采用滑动窗口策略处理大尺寸图像,并通过后处理优化二进制掩码。
Result: DinoLizer在多个生成性修复数据集上的IoU比最优模型高出12%,并且对常见的后处理操作(如JPEG压缩)具有鲁棒性。
Insight: Vision Transformer的表示能力在篡改定位任务中表现优异,DINOv2及其后继版本DINOv3的对比实验进一步验证了DinoLizer的优越性。
Abstract: We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer’s patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer’s superiority. The code will be publicly available upon acceptance of the paper.
[28] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design
Daeheon Jeong,Seoyeon Byun,Kihoon Son,Dae Hyun Kim,Juho Kim
Main category: cs.CV
TL;DR: CANVAS是一个新基准,用于评估视觉语言模型(VLM)在工具驱动的用户界面(UI)设计任务中的表现。它包含598个任务,覆盖30种功能类别,并通过设计软件的工具调用来测试模型的迭代设计能力。
Details
Motivation: UI设计是一个迭代过程,但目前缺乏评估VLM在设计软件中工具调用能力的基准。这限制了模型与设计师协作的潜力。Contribution: 提出了CANVAS基准,专注于工具驱动的UI设计任务,并提供丰富的数据集和对VLM表现的深入分析。
Method: CANVAS包含两种任务类型:设计复制(完整UI屏幕重现)和设计修改(特定部分调整)。任务通过工具调用逐步执行。
Result: 领先模型展现出更策略性的工具调用,提升了设计质量。同时,研究发现了常见错误模式,为未来改进提供方向。
Insight: VLM在设计软件中的工具调用能力是其与设计师协作的关键,CANVAS为其评估和改进提供了重要基础。
Abstract: User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs’ potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.
[29] Text-Guided Semantic Image Encoder
Raghuveer Thirukovalluru,Xiaochuang Han,Bhuwan Dhingra,Emily Dinan,Maha Elbayad
Main category: cs.CV
TL;DR: TIE是一种文本引导的图像编码器,通过将图像表示与输入文本查询条件化,显著提升了视觉语言模型的性能与效率。
Details
Motivation: 现有视觉语言模型的图像编码器通常是独立预训练的,忽略了下游任务和文本查询的特定需求,导致性能受限。Contribution: 提出了TIE(Text-Guided Semantic Image Encoder),首次实现了图像表示的条件化生成,显著提升了视觉语言模型的性能与推理效率。
Method: TIE通过文本查询条件化训练图像编码器,使其能够关注查询相关的区域,同时减少了所需的图像分块(tokens)数量。
Result: 在9个图像到文本任务上,TIE相较传统编码器平均提升了1.5(1B规模)和1.3(3B规模)个百分点,DocVQA和InfoVQA任务上提升高达6个百分点。
Insight: 文本条件化的训练能够有效优化编码器,使其专注于关键的视觉特征,同时提升模型的解释性和查询特异性。
Abstract: Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.
[30] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues
Sindhuja Penchala,Gavin Money,Gabriel Marques,Samuel Wood,Jessica Kirschman,Travis Atkison,Shahram Rahimi,Noorbakhsh Amiri Golilarz
Main category: cs.CV
TL;DR: SMARC是一种联合表面材质重建和分类的统一模型,仅需单个10%的图像补丁即可完成任务,结合了Partial Convolutional U-Net和分类头,在极端稀疏观测下表现出色。
Details
Motivation: 现有方法依赖密集或全局观察,无法在受限或局部视角下有效工作。SMARC旨在通过最小视觉输入解决问题。Contribution: 提出了SMARC模型,实现了在极小输入下的联合材质重建与分类,并在PSNR和分类准确率上达到SOTA。
Method: 采用Partial Convolutional U-Net结合分类头的方法,支持空间修复和语义理解。
Result: 在Touch and Go数据集上,PSNR达17.55 dB,分类准确率达85.10%,优于其他五种模型。
Insight: Partial Convolution在数据缺失的空间推理中表现优越,为最小视觉表面理解奠定了基础。
Abstract: Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.
[31] LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling
Zuhao Yang,Sudong Wang,Kaichen Zhang,Keming Wu,Sicong Leng,Yifan Zhang,Chengwei Qin,Shijian Lu,Xingxuan Li,Lidong Bing
Main category: cs.CV
TL;DR: LongVT引入了一个新的框架,通过多模态工具调用链激励长视频推理,解决了现有大模型在长视频理解中的幻觉问题,并发布了数据集VideoSIAH以支持训练和评估。
Details
Motivation: 现有的大多模态模型在处理长视频时容易产生幻觉,因为信息稀疏且时间分散。受人类观看长视频的方式启发(先全局浏览再聚焦细节),作者提出了LongVT框架。Contribution: 1. 提出了LongVT框架,通过多模态工具调用链实现长视频推理;2. 发布了VideoSIAH数据集,支持训练和评估;3. 设计了三个阶段训练策略,显著提升了长视频理解的性能。
Method: LongVT利用大模型固有的时间定位能力作为视频裁剪工具,以全局到局部的迭代方式推理,直到答案基于视觉证据。训练分为三阶段:工具集成的冷启动监督微调、代理强化学习和代理强化微调。
Result: LongVT在四个长视频理解和推理基准测试中均优于现有基线。
Insight: 视频理解任务的关键是结合全局和局部的信息,而工具调用链的设计可以有效地减少幻觉问题。
Abstract: Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables “Thinking with Long Videos” via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs’ inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
[32] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models
Souradeep Dutta,Keshav Bulia,Neena S Nair
Main category: cs.CV
TL;DR: 这篇论文重新审视了KRISP(知识增强视觉语言模型),并提出了一个轻量级的复现版本,揭示了原始模型的设计缺陷和隐性问题,同时探讨了在资源受限条件下知识增强VQA架构的可扩展性和有效性。
Details
Motivation: 原始KRISP模型虽然有效,但由于其工业级训练的复杂性和计算需求,难以在资源受限的设备上部署。本文旨在提供一个轻量级的复现版本,以解决这些问题。Contribution: 1. 提出了一个参数显著减少的轻量级KRISP复现模型;2. 揭示了原始模型的设计缺陷和隐性问题;3. 通过系统性消融研究探讨了知识增强VQA架构的可扩展性和有效性。
Method: 通过简化模型结构和减少参数数量,重新实现了KRISP,并在合成VQA数据和DAQUAR数据集上进行了评估。
Result: 复现模型的性能约为原始模型的75%,但在资源受限的设备(如智能手机和AR-VR设备)上能够高效运行。
Insight: 轻量级设计和参数优化可以显著提升模型在边缘设备上的适用性,同时避免幻觉输出,生成更可靠的推理结果。
Abstract: Facebook AI Research introduced KRISP [4], which integrates structured external knowledge into pipelines for vision-language reasoning. Despite its effectiveness, the original model has been developed for industrial-scale training, is computationally demanding, and is tightly connected to a large backbone. In this work, we reexamine KRISP from a different angle and offer a lightweight reproduction with significantly fewer parameters. Even though our replicated model performs about 75 % of the original, the replication process uncovers a number of design flaws, real-world pitfalls, and implicit problems that were not fully covered in the original paper. We offer insights into the scalability and efficacy of knowledge-enhanced VQA architectures under resource constraints through systematic ablation studies, which include a proof-of-concept on synthetic VQA data and evaluation on the DAQUAR dataset. Our model, configured with a low parameter setup and constrained by the external Knowledge graph domain, prevents AI hallucinations and generates outputs solely within that domain. Minimal parameters allow us to function on edge devices like smartphones and AR-VR, further improving offline visual reasoning.
[33] Intriguing Properties of Dynamic Sampling Networks
Dario Morle,Reid Zaffino
Main category: cs.CV
TL;DR: 本文提出了一种名为“warping”的新算子,统一分析了动态采样机制,发现模型训练中前向与反向传播的不对称性,并揭示了动态采样网络与传统卷积算子的正交性。
Details
Motivation: 动态采样机制在深度学习中广泛应用,但缺乏统一的理论分析。本文旨在通过新算子“warping”统一现有方法,并分析其统计特性与训练行为。Contribution: 1. 提出“warping”算子,统一了变形卷积、主动卷积单元和空间变换网络等方法;2. 揭示了前向与反向传播的不对称性;3. 证明了动态采样算子与传统卷积的正交性;4. 提出了新的损失景观可视化方法。
Method: 1. 开发并分析了“warping”算子,统一动态采样方法;2. 输入建模为IID变量和同质随机场进行统计;3. 结合理论分析与实验验证动态采样网络的训练稳定性;4. 引入梯度更新信息直接可视化损失景观。
Result: 理论分析与实验表明,“warping”算子能统一动态采样方法,揭示训练不对称性,并提供稳定的训练条件。
Insight: 动态采样算子与传统卷积正交,代表了新的算子类别。损失景观可视化方法有助于更直观地理解模型学习行为。
Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term “warping”. Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.
[34] Layer-Aware Video Composition via Split-then-Merge
Ozgur Kara,Yujia Chen,Ming-Hsuan Yang,James M. Rehg,Wen-Sheng Chu,Du Tran
Main category: cs.CV
TL;DR: 论文提出Split-then-Merge(StM)框架,通过分层动态前景与背景的自组合学习,解决生成视频合成的数据稀缺问题,实现更真实的视频生成。
Details
Motivation: 传统方法依赖标注数据或人工规则,难以学习复杂的视频组合动态。StM通过无监督方式解决这一问题,提升生成视频的控制能力。Contribution: 1. 提出StM框架,分层学习视频动态组合;2. 引入转换感知训练流程和多层融合增强;3. 设计身份保留损失,确保前景保真度。
Method: StM将大量无标注视频拆分为动态前景和背景层,通过自组合学习场景与主体的交互。结合多层融合增强和身份保留损失优化生成效果。
Result: 实验表明,StM在定量基准和人类/VLLM定性评估中均优于现有方法。
Insight: 无监督分层学习为生成视频合成提供了新的数据高效解决方案,同时提升了生成结果的真实性。
Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io
[35] SPHINX: A Synthetic Environment for Visual Perception and Reasoning
Md Tanvirul Alam,Saksham Aggarwal,Justin Yang Chae,Nidhi Rastogi
Main category: cs.CV
TL;DR: 论文介绍了Sphinx,一个针对视觉感知和推理核心认知原语的合成环境,生成了包含多种任务的基准测试,展示了最新视觉语言模型性能与人类的差距,并提出强化学习方法有效提升模型表现。
Details
Motivation: 现有视觉推理任务的数据集和评估环境缺乏多样性和可验证性,限制了模型在多模态推理能力上的进步。Contribution: 1) 设计了Sphinx合成环境,支持多样化的视觉推理任务;2) 构建了涵盖25种任务类型的基准测试;3) 展示了强化学习方法RLVR的有效性。
Method: 1) 使用程序化生成方法构建包含多种任务的数据集;2) 采用强化学习与可验证奖励(RLVR)提升模型表现。
Result: 测试显示GPT-5仅达到51.1%准确率,远低于人类;RLVR显著提升了模型在Sphinx任务上的表现,并有助于外部视觉推理任务。
Insight: 合成环境和可验证奖励是提升多模态推理能力的有效途径;当前LVLMs在复杂视觉推理任务上仍存在明显短板。
Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.
[36] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
Samuele Dell’Erba,Andrew D. Bagdanov
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的扩散先验方法,通过优化视觉反转(OVI)替代传统的训练依赖的扩散先验网络,显著降低了计算成本和数据需求。
Details
Motivation: 传统文本到图像生成依赖于训练密集的扩散先验网络,计算成本高且需要大量数据。本文旨在探索一种无需训练的数据无关替代方案。Contribution: 1. 提出优化视觉反转(OVI)作为传统扩散先验的无训练替代;2. 提出两种新型约束(马氏距离和最近邻损失)提升优化效果;3. 揭示了当前评估基准T2I-CompBench++的缺陷。
Method: 采用随机伪令牌初始化潜在视觉表示,通过最大化与文本嵌入的余弦相似性进行迭代优化,并结合马氏距离和最近邻损失约束优化过程。
Result: 在Kandinsky 2.2上的实验表明,OVI(尤其是最近邻方法)在视觉保真度上优于基线,定量得分媲美或超越现有高效先验方法。
Insight: 当前文本到图像评估基准可能低估了视觉质量的重要性,优化视觉反转方法是值得进一步研究的有效路径。
Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.
[37] RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs
Roman Naeem,David Hagerman,Jennifer Alvén,Fredrik Kahl
Main category: cs.CV
TL;DR: RefTr是一种3D图像到图的模型,通过循环精炼合并轨迹生成血管树的中心线图,具有高召回率和较低参数量。
Details
Motivation: 在医疗影像中,准确检测血管树的中心线及其拓扑结构对诊断和治疗至关重要,漏检小分支可能导致严重后果。Contribution: 提出RefTr模型,通过Producer-Refiner架构和循环精炼轨迹的方案,显著减少参数量并提高召回率。
Method: 采用基于Transformer解码器的Producer-Refiner架构,Producer提出初始合并轨迹,Refiner循环精炼并生成最终中心线图。
Result: 在多个公开数据集上,RefTr召回率优于前人工作,参数量减少2.4倍,推理速度更快。
Insight: 合并轨迹表示和循环精炼方案能有效保持树状拓扑结构,减少冗余计算,适合医疗影像中的血管树分析。
Abstract: Tubular trees, such as blood vessels and lung airways, are essential for material transport within the human body. Accurately detecting their centerlines with correct tree topology is critical for clinical tasks such as diagnosis, treatment planning, and surgical navigation. In these applications, maintaining high recall is crucial, as missing small branches can result in fatal mistakes caused by incomplete assessments or undetected abnormalities. We present RefTr, a 3D image-to-graph model for centerline generation of vascular trees via recurrent refinement of confluent trajectories. RefTr uses a Producer-Refiner architecture based on a Transformer decoder, where the Producer proposes a set of initial confluent trajectories that are recurrently refined by the Refiner to produce final trajectories, which forms the centerline graph. The confluent trajectory representation enables refinement of complete trajectories while explicitly enforcing a valid tree topology. The recurrent refinement scheme improves precision and reuses the same Refiner block across multiple steps, yielding a 2.4x reduction in decoder parameters compared to previous SOTA. We also introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and boost precision. Across multiple public centerline datasets, RefTr achieves superior recall and comparable precision to previous SOTA, while offering faster inference and substantially fewer parameters, demonstrating its potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging.
[38] MODEST: Multi-Optics Depth-of-Field Stereo Dataset
Nisarg K. Trivedi,Vinayak A. Belludi,Li-Yun Wang,Pardis Taghavi,Dante Lok
Main category: cs.CV
TL;DR: 论文提出了首个高分辨率(5472×3648像素)的真实立体DSLR数据集MODEST,包含18000张图像,系统性地覆盖了不同焦距和光圈的光学配置,用于深度估计和景深渲染任务。
Details
Motivation: 当前深度估计和景深渲染研究受限于缺乏大规模、高保真的真实立体数据集,导致模型在真实场景中的泛化能力不足。Contribution: 论文的主要贡献是构建了一个高质量的真实立体数据集,覆盖了多种光学配置(50种),支持深度估计、景深渲染等任务的研究。
Method: 通过使用两台相同的相机在不同焦距(28-70mm)和光圈(f/2.8-f/22)下拍摄复杂场景,系统性地生成数据集。
Result: 数据集展示了当前单目和立体深度估计方法在真实光学条件下的挑战。
Insight: MODEST数据集有助于弥合合成数据与真实光学之间的差距,推动了任务如3D重建和新视角合成的进展。
Abstract: Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.
[39] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries
Sree Bhattacharyya,Yaman Kumar Singla,Sudhir Yarram,Somesh Kumar Singh,Harini S,James Z. Wang
Main category: cs.CV
TL;DR: 论文提出了一种无监督的方法,利用‘舌尖效应’(ToT)检索查询构建大规模视觉内容记忆性数据集,并通过微调大型视觉语言模型在记忆性描述生成和ToT检索任务中取得优异表现。
Details
Motivation: 当前视觉内容记忆性研究面临标注成本高、数据集多样性不足的问题。大多数数据集仅提供聚合记忆性评分,缺乏自然开放式回忆描述的细节信号。Contribution: 1) 引入首个大规模无监督视觉记忆性数据集(含82,000个视频);2) 提出基于ToT检索查询的记忆性信号建模方法;3) 开发了在记忆性描述生成和ToT检索任务中优于现有模型的解决方案。
Method: 利用Reddit等平台的ToT检索查询构建数据集,通过微调大型视觉语言模型完成记忆性描述生成,并使用对比学习策略实现多模态ToT检索。
Result: 微调后的模型在生成开放式记忆性描述上优于GPT-4o,同时实现了首个多模态ToT检索模型。
Insight: 无监督数据(如ToT查询)能有效捕捉视觉内容的记忆性信号,为记忆性研究提供了新的数据来源和建模方向。
Abstract: Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.
[40] Estimating Fog Parameters from a Sequence of Stereo Images
Yining Ding,João F. C. Mota,Andrew M. Wallace,Sen Wang
Main category: cs.CV
TL;DR: 本文提出了一种从立体雾天图像序列中动态估计雾参数的方法,通过同时优化所有参数避免误差传播,并创建了首个真实雾天数据集SDIRF。
Details
Motivation: 现有方法通常顺序估计雾参数,容易导致误差传播;同时,真实世界的雾通常是全局不均匀的,需要一种更鲁棒的估计方法。Contribution: 1. 提出一种同时优化雾参数的方法;2. 创建首个高质量真实雾天数据集SDIRF;3. 公开代码和数据集以推动研究。
Method: 通过新颖的优化问题同时估计雾参数,假设雾仅在局部均匀,并结合立体图像序列动态更新参数。
Result: 在合成数据和SDIRF数据上的实验表明,该方法优于现有方法,尤其适应真实雾的不均匀性。
Insight: 局部均匀假设和同步优化策略显著提升了雾参数估计的鲁棒性,SDIRF数据集为雾天视觉感知研究提供了重要资源。
Abstract: We propose a method which, given a sequence of stereo foggy images, estimates the parameters of a fog model and updates them dynamically. In contrast with previous approaches, which estimate the parameters sequentially and thus are prone to error propagation, our algorithm estimates all the parameters simultaneously by solving a novel optimisation problem. By assuming that fog is only locally homogeneous, our method effectively handles real-world fog, which is often globally inhomogeneous. The proposed algorithm can be easily used as an add-on module in existing visual Simultaneous Localisation and Mapping (SLAM) or odometry systems in the presence of fog. In order to assess our method, we also created a new dataset, the Stereo Driving In Real Fog (SDIRF), consisting of high-quality, consecutive stereo frames of real, foggy road scenes under a variety of visibility conditions, totalling over 40 minutes and 34k frames. As a first-of-its-kind, SDIRF contains the camera’s photometric parameters calibrated in a lab environment, which is a prerequisite for correctly applying the atmospheric scattering model to foggy images. The dataset also includes the counterpart clear data of the same routes recorded in overcast weather, which is useful for companion work in image defogging and depth reconstruction. We conducted extensive experiments using both synthetic foggy data and real foggy sequences from SDIRF to demonstrate the superiority of the proposed algorithm over prior methods. Our method not only produces the most accurate estimates on synthetic data, but also adapts better to real fog. We make our code and SDIRF publicly available\footnote{https://github.com/SenseRoboticsLab/estimating-fog-parameters} to the community with the aim of advancing the research on visual perception in fog.
[41] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
Jiancheng Pan,Runze Wang,Tianwen Qian,Mohammad Mahdi,Yanwei Fu,Xiangyang Xue,Xiaomeng Huang,Luc Van Gool,Danda Pani Paudel,Yuqian Fu
Main category: cs.CV
TL;DR: V^2-SAM是一个统一的跨视角对象对应框架,通过结合SAM2和多提示专家,解决了跨视角对象对应任务中的视角和外观变化问题。
Details
Motivation: 跨视角对象对应任务(如自我-外部对象对应)由于视角和外观的剧烈变化而极具挑战性,现有分割模型(如SAM2)难以直接应用。Contribution: 1)提出了V^2-SAM框架,将SAM2从单视角分割扩展到跨视角对应;2)设计了两个互补的提示生成器(V^2-Anchor和V^2-Visual);3)引入了多专家设计和后验一致性选择器(PCCS)以选择最佳专家。
Method: 1)V^2-Anchor基于DINOv3特征建立几何感知对应;2)V^2-Visual通过新颖的视觉提示匹配器对齐自我-外部表示;3)多专家设计和PCCS选择最可靠的专家。
Result: 在Ego-Exo4D、DAVIS-2017和HANDAL-X数据集上达到了最先进的性能。
Insight: 结合几何感知和视觉提示的互补性,以及后验一致性选择机制,显著提升了跨视角对象对应的鲁棒性。
Abstract: Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).
[42] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
Taehoon Kim,Henry Gouk,Timothy Hospedales
Main category: cs.CV
TL;DR: 论文提出了Null-Text Test-Time Alignment (Null-TTA)方法,通过在推理过程中优化无条件嵌入来对齐扩散模型,避免了对潜在变量或噪声变量的操纵,从而防止奖励黑客行为并确保语义一致性。
Details
Motivation: 现有的测试时对齐方法容易对目标奖励函数进行欠优化或过优化(奖励黑客行为),导致语义不一致或非语义噪声模式的利用。Contribution: 提出了Null-TTA方法,通过优化无条件嵌入实现对扩散模型的语义对齐,避免奖励黑客行为,并在不更新模型参数的情况下直接调整生成分布。
Method: Null-TTA在分类器自由引导中优化无条件嵌入,利用文本嵌入空间的结构化语义特性实现语义对齐。
Result: Null-TTA在目标测试时对齐任务中取得了最优性能,同时保持了强大的跨奖励泛化能力。
Insight: 语义空间优化是一种有效且原理性的测试时对齐新范式,能够确保对齐发生在语义相干的空间中。
Abstract: Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward). Since the unconditional embedding in classifier-free guidance serves as the anchor for the model’s generative distribution, Null-TTA directly steers model’s generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters. Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.
[43] GaINeR: Geometry-Aware Implicit Network Representation
Weronika Jakubowska,Mikołaj Zieliński,Rafał Tobiasz,Krzysztof Byrski,Maciej Zięba,Dominik Belter,Przemysław Spurek
Main category: cs.CV
TL;DR: GaINeR是一种新型的几何感知隐式网络表示方法,通过结合可训练的高斯分布和神经网络,解决了传统INR缺乏显式几何结构和局部编辑能力的问题。
Details
Motivation: 传统隐式神经表示(INR)在建模连续2D图像时缺乏显式几何结构,限制其在动态或交互式场景中的应用。GaINeR旨在提升INR的几何感知能力与局部编辑灵活性。Contribution: 提出GaINeR框架,结合高斯分布与神经网络,实现图像连续表示的同时引入显式几何结构,支持局部编辑和物理模拟集成。
Method: 模型为每个图像坐标检索K个最近高斯分布,聚合距离加权的嵌入,并通过神经网络预测RGB值。
Result: GaINeR在保持高保真图像重建的同时,提供了几何结构的可解释性和局部编辑的灵活性。
Insight: 高斯分布的引入增强了INR的几何表达能力,为动态和交互式图像处理提供了新思路。
Abstract: Implicit Neural Representations (INRs) have become an essential tool for modeling continuous 2D images, enabling high-fidelity reconstruction, super-resolution, and compression. Popular architectures such as SIREN, WIRE, and FINER demonstrate the potential of INR for capturing fine-grained image details. However, traditional INRs often lack explicit geometric structure and have limited capabilities for local editing or integration with physical simulation, restricting their applicability in dynamic or interactive settings. To address these limitations, we propose GaINeR: Geometry-Aware Implicit Network Representation, a novel framework for 2D images that combines trainable Gaussian distributions with a neural network-based INR. For a given image coordinate, the model retrieves the K nearest Gaussians, aggregates distance-weighted embeddings, and predicts the RGB value via a neural network. This design enables continuous image representation, interpretable geometric structure, and flexible local editing, providing a foundation for physically aware and interactive image manipulation. The official implementation of our method is publicly available at https://github.com/WJakubowska/GaINeR.
[44] Smooth regularization for efficient video recognition
Gil Goldman,Raja Giryes,Mahadev Satyanarayanan
Main category: cs.CV
TL;DR: 本文提出了一种平滑正则化技术,通过建模中间层嵌入的变化为高斯随机游走,为轻量级视频识别模型提供了显著的时间归纳偏好,提升了识别精度。
Details
Motivation: 视频识别模型在处理连续帧时需要保持时间一致性,但现有轻量级架构难以有效捕捉复杂的时空动态。本文旨在通过引入平滑正则化,减少帧间表征的突变,从而提升模型的性能。Contribution: 主要贡献是提出了一种高斯随机游走(GRW)驱动的平滑正则化方法,显著提升了轻量级视频识别模型的准确性,尤其是在Kinetics-600数据集上实现了3.8%至6.4%的精度提升。
Method: 方法的核心是通过高斯随机游走模型约束连续帧的中间层嵌入变化,从而惩罚表征的突变,促进低加速度的平滑解。这一技术特别适用于轻量级架构,帮助其更好地捕捉视频的时间动态。
Result: 在Kinetics-600数据集上,MoViNets模型家族的精度提升了3.8%至6.1%,MobileNetV3和MoViNets-Stream家族的精度提升了4.9%至6.4%,均优于当前最优方法。
Insight: 平滑正则化通过显式建模时间一致性,显著提升了轻量级模型的表现,表明时间动力学的高效捕获对视频任务至关重要。
Abstract: We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.
[45] Open Vocabulary Compositional Explanations for Neuron Alignment
Biagio La Rosa,Leilani H. Gilpin
Main category: cs.CV
TL;DR: 该论文提出了一个开放词汇的组合解释框架,用于可视化领域中神经元与人类知识之间的对齐关系,摆脱了对人工标注数据的依赖,提高了灵活性和适用性。
Details
Motivation: 传统的组合解释方法依赖于人工标注的数据集,限制了其适用范围和灵活性。论文旨在通过开放词汇的分割模型,实现对任意概念和数据集的神经元对齐分析。Contribution: 1. 提出了一个开放词汇的组合解释框架,支持任意概念的查询。\n2. 利用了开放词汇语义分割模型生成的掩码来计算解释。\n3. 在定量指标和人类可解释性上与已有方法进行了对比。
Method: 1. 允许用户指定任意概念。\n2. 使用开放词汇语义分割模型生成掩码。\n3. 从掩码中推导组合解释。
Result: 论文展示了框架在灵活性和任务适应性方面的优势,并与传统方法进行了定量和定性对比,证明了其有效性。
Insight: 通过模型生成的掩码代替人工标注数据,不仅扩展了适用范围,还提供了更高的灵活性,揭示了神经元与开放词汇概念的潜在对齐关系。
Abstract: Neurons are the fundamental building blocks of deep neural networks, and their interconnections allow AI to achieve unprecedented results. Motivated by the goal of understanding how neurons encode information, compositional explanations leverage logical relationships between concepts to express the spatial alignment between neuron activations and human knowledge. However, these explanations rely on human-annotated datasets, restricting their applicability to specific domains and predefined concepts. This paper addresses this limitation by introducing a framework for the vision domain that allows users to probe neurons for arbitrary concepts and datasets. Specifically, the framework leverages masks generated by open vocabulary semantic segmentation to compute open vocabulary compositional explanations. The proposed framework consists of three steps: specifying arbitrary concepts, generating semantic segmentation masks using open vocabulary models, and deriving compositional explanations from these masks. The paper compares the proposed framework with previous methods for computing compositional explanations both in terms of quantitative metrics and human interpretability, analyzes the differences in explanations when shifting from human-annotated data to model-annotated data, and showcases the additional capabilities provided by the framework in terms of flexibility of the explanations with respect to the tasks and properties of interest.
[46] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L
Henry Marichal,Joaquin Blanco,Diego Passarella,Gregory Randall
Main category: cs.CV
TL;DR: 本文介绍了一个名为 UruDendro4 的新基准数据集,用于自动检测 Pinus taeda L. 树木横截面图像中的年轮。该数据集包含 102 张手动标注的图像,提供了多高度样本以实现年轮体积建模,并为自动检测提供了性能基线。
Details
Motivation: 传统的年轮手动测量方法耗时且不精确,已出现了自动算法和数据集的开发需求。然而,现有公共数据集稀缺,无法满足研究需求。Contribution: 1. 提供了 UruDendro4 数据集,包含多高度样本以支持年轮体积建模;2. 给出了基于现有方法的自动检测性能基线;3. 证明了包含该数据集的训练能提升模型的泛化能力。
Method: 1. 数据集构建:收集并手动标注 102 张 Pinus taeda L. 横截面图像;2. 性能评估:使用 DeepCS-TRD 等方法进行年轮检测,并计算平均精度和召回率等指标;3. 消融实验:验证最终参数配置的有效性。
Result: DeepCS-TRD 方法表现最佳,平均精度 0.838,平均召回率 0.782,Adapted Rand Error 得分 0.084。实验表明,使用该数据集训练提升了模型的泛化能力。
Insight: 1. 多高度样本的数据集有助于年轮体积建模;2. 引入先进方法(如 DeepCS-TRD)可显著提升自动检测性能;3. 数据集的多样性对模型泛化至关重要。
Abstract: Tree-ring growth represents the annual wood increment for a tree, and quantifying it allows researchers to assess which silvicultural practices are best suited for each species. Manual measurement of this growth is time-consuming and often imprecise, as it is typically performed along 4 to 8 radial directions on a cross-sectional disc. In recent years, automated algorithms and datasets have emerged to enhance accuracy and automate the delineation of annual rings in cross-sectional images. To address the scarcity of wood cross-section data, we introduce the UruDendro4 dataset, a collection of 102 image samples of Pinus taeda L., each manually annotated with annual growth rings. Unlike existing public datasets, UruDendro4 includes samples extracted at multiple heights along the stem, allowing for the volumetric modeling of annual growth using manually delineated rings. This dataset (images and annotations) allows the development of volumetric models for annual wood estimation based on cross-sectional imagery. Additionally, we provide a performance baseline for automatic ring detection on this dataset using state-of-the-art methods. The highest performance was achieved by the DeepCS-TRD method, with a mean Average Precision of 0.838, a mean Average Recall of 0.782, and an Adapted Rand Error score of 0.084. A series of ablation experiments were conducted to empirically validate the final parameter configuration. Furthermore, we empirically demonstrate that training a learning model including this dataset improves the model’s generalization in the tree-ring detection task.
[47] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model
Rawa Mohammed,Mina Attin,Bryar Shareef
Main category: cs.CV
TL;DR: BUSTR是一种基于描述符感知的多任务视觉语言模型,无需配对的图像-报告数据即可生成乳腺癌超声报告。通过结合结构化描述符和多任务损失,该模型在多个数据集上提升了报告生成的质量和临床效果。
Details
Motivation: 当前乳腺癌超声报告的自动化生成面临着配对数据稀缺和大语言模型幻觉风险的挑战。Contribution: BUSTR提出了一种不需要配对监督的多任务框架,通过结构化描述符和视觉表征对齐来生成高质量报告。
Method: 模型采用多头Swin编码器和多任务损失学习描述符感知的视觉表征,并通过双层次目标对齐视觉与文本标记。
Result: 在BrEaST和BUS-BRA数据集上,BUSTR在报告生成指标和临床效果上均表现优异。
Insight: 结构化描述符和视觉-文本对齐的结合可以显著提升报告生成的质量,同时避免配对数据的依赖。
Abstract: Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR
[48] Beyond Realism: Learning the Art of Expressive Composition with StickerNet
Haoming Lu,David Kocharian,Humphrey Shi
Main category: cs.CV
TL;DR: 论文提出了一个新的图像合成任务——表现性合成,关注艺术性而非真实感,并提出了两阶段框架StickerNet,利用真实编辑行为数据集训练,在用户研究中表现优异。
Details
Motivation: 当前图像合成研究主要关注真实感和语义合理性,但在实际的内容创作场景中,用户更倾向于艺术性和社交吸引力。论文通过观察用户行为,提出了表现性合成任务。Contribution: 定义了表现性合成任务,提出了StickerNet框架,直接从真实编辑行为数据中学习,并通过用户研究和定量评估验证了其有效性。
Method: StickerNet分为两阶段:首先确定合成类型,然后预测不透明度、掩码、位置和缩放等参数。数据集来自真实的1.8百万次编辑行为。
Result: StickerNet在用户研究和定量评估中优于常见基线方法,更接近人类的编辑行为。
Insight: 论文强调了在视觉理解中关注表达性和用户意图的重要性,而非传统的真实感追求。
Abstract: As a widely used operation in image editing workflows, image composition has traditionally been studied with a focus on achieving visual realism and semantic plausibility. However, in practical editing scenarios of the modern content creation landscape, many compositions are not intended to preserve realism. Instead, users of online platforms motivated by gaining community recognition often aim to create content that is more artistic, playful, or socially engaging. Taking inspiration from this observation, we define the expressive composition task, a new formulation of image composition that embraces stylistic diversity and looser placement logic, reflecting how users edit images on real-world creative platforms. To address this underexplored problem, we present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, mask, location, and scale accordingly. Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform, each reflecting user-community validated placement decisions. This grounding in authentic editing behavior ensures strong alignment between task definition and training supervision. User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior, demonstrating the effectiveness of learning from real-world editing patterns despite the inherent ambiguity of the task. This work introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism.
[49] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs
Md Adnan Arefeen,Biplob Debnath,Srimat Chakradhar
Main category: cs.CV
TL;DR: TrafficLens introduces a tailored algorithm for multi-camera traffic intersections. It employs a sequential approach utilizing overlapping coverage areas of cameras iteratively applying VLMs varying token limits using previous outputs as prompts subsequent cameras enabling rapid detailed textual descriptions reducing processing time.
Details
Motivation: Despite advancements LLMs RAG systems still face limitations efficiently managing analyzing huge video datasets requiring advanced integrated analytical tools.Contribution: The paper proposes TrafficLens tailored algorithm sequentially applying VLMs enabling rapid generation detailed textual descriptions while reducing duplicative processing time.
Method: TrafficLens employs a sequential approach utilizing overlapping coverage areas of multiple cameras iteratively applying VLMs with varying token limits using previous outputs as prompts.
Result: Experimental results demonstrate TrafficLenes reduces video-text conversion time upto
Insight: In conclusion the paper introduces TrafficLens a tailored algorithm that integrates LLMs with video processing reducing processing time redundant duplicative comparisons.
Abstract: Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.
[50] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI
Al Amin,Kamrul Hasan,Liang Hong,Sharif Ullah
Main category: cs.CV
TL;DR: 本文提出了一种结合Vision Transformers(ViT)和轻量级同态加密(HE)的隐私保护联邦学习框架,用于医疗AI中的多机构协作诊断。通过加密CLS令牌实现通信效率提升和隐私保护。
Details
Motivation: 医疗数据共享受隐私法规限制,传统联邦学习的梯度易受攻击,需要更安全的协作学习方法。Contribution: 1) 提出ViT与HE结合的隐私保护框架;2) 使用CLS令牌作为紧凑特征表示,减少通信开销;3) 在加密域实现推理,防止模型反演攻击。
Method: 利用ViT的CLS令牌作为特征表示,通过CKKS同态加密加密令牌,实现安全聚合和加密推理。
Result: CLS令牌加密比梯度加密减少30倍通信量;分类准确率在非加密域达96.12%,加密域达90.02%;有效防止图像重建攻击。
Insight: 轻量级加密结合ViT的特征提取能力,能够在保护隐私的同时保持高分类性能,适用于医疗AI协作学习。
Abstract: Collaborative machine learning across healthcare institutions promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing. While federated learning (FL) enables decentralized training without raw data exchange, recent studies show that model gradients in conventional FL remain vulnerable to reconstruction attacks, potentially exposing sensitive medical information. This paper presents a privacy-preserving federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification. The approach leverages the ViT CLS token as a compact 768-dimensional feature representation for secure aggregation, encrypting these tokens using CKKS homomorphic encryption before transmission to the server. We demonstrate that encrypting CLS tokens achieves a 30-fold communication reduction compared to gradient encryption while maintaining strong privacy guarantees. Through evaluation on a three-client federated setup for lung cancer histopathology classification, we show that gradients are highly susceptible to model inversion attacks (PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741), enabling near-perfect image reconstruction. In contrast, the proposed CLS-protected HE approach prevents such attacks while enabling encrypted inference directly on ciphertexts, requiring only 326 KB of encrypted data transmission per aggregation round. The framework achieves 96.12 percent global classification accuracy in the unencrypted domain and 90.02 percent in the encrypted domain.
[51] GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
Yuxiao Xiang,Junchi Chen,Zhenchao Jin,Changtao Miao,Haojie Yuan,Qi Chu,Tao Gong,Nenghai Yu
Main category: cs.CV
TL;DR: GuardTrace-VL是一种新型视觉语言安全审计方法,通过监控多模态推理过程的中间内容,检测不安全内容。现有方法仅关注输入问题和最终答案,而忽略了推理过程中可能出现的偏见或违规内容。GuardTrace-VL在联合图像-文本分析下表现优异,F1分数比之前最佳方法提高13.5%。
Details
Motivation: 现有的多模态安全防护方法仅评估输入问题和最终答案,忽略了中间推理过程可能产生的有害内容(如偏见或违规使用视觉上下文),导致潜在风险未被检测。Contribution: 1. 提出GuardTrace-VL,首个监控多模态推理全过程的安全审计方法;2. 构建GuardTrace数据集,支持训练与评估;3. 提出三阶段渐进训练方案,学习不同风险级别的安全偏好。
Method: 采用联合图像-文本分析,监控QTA(Question-Thinking-Answer)管道。通过数据增强和三阶段渐进训练方案,模型学习上下文相关的安全偏好。
Result: 在涵盖域内和域外场景的测试集上,GuardTrace-VL的F1分数达93.1%,比之前最佳方法提高13.5%。
Insight: 多模态推理过程的中间步骤可能隐藏有害内容,需通过细粒度监控和安全偏好学习来提升安全性。
Abstract: Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
[52] From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
Jingxi Chen,Yixiao Zhang,Xiaoye Qian,Zongxia Li,Cornelia Fermuller,Caren Chen,Yiannis Aloimonos
Main category: cs.CV
TL;DR: 该论文提出了一种利用生成式修复(inpainting)模型进行图像分层(layer decomposition)的方法,通过轻量级微调和多模态上下文融合模块,实现了高效的图像层分解。
Details
Motivation: 图像的分层表示可以实现元素的独立编辑,但在单幅图像中实现分层分解仍具有挑战性。论文发现分层分解与修复(inpainting)任务之间存在联系,因此探索了如何利用修复模型实现分层分解。Contribution: 主要贡献包括:(1)将基于扩散的修复模型适应于分层分解任务;(2)提出了一种多模态上下文融合模块以保留细节;(3)使用开源数据构建合成数据集进行训练,并在目标移除和遮挡恢复任务中表现出色。
Method: 方法包括:(1)通过轻量级微调调整修复模型;(2)引入多模态上下文融合模块,具有线性注意力复杂度;(3)训练数据来源于开源资产的合成数据集。
Result: 模型在目标移除和遮挡恢复任务中表现优异,为下游编辑和创意应用提供了新的可能性。
Insight: 论文揭示了修复模型在分层分解任务中的潜力,并展示了轻量级调整和多模态融合的有效性。
Abstract: Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.
[53] Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning
Xiaoxing You,Qiang Huang,Lingyu Li,Chi Zhang,Xiaopeng Liu,Min Zhang,Jun Yu
Main category: cs.CV
TL;DR: MERGE是一个多模态实体感知检索增强生成框架,针对新闻图像字幕生成中的信息不完整、跨模态对齐弱和视觉-实体匹配不佳问题,通过构建实体中心的多模态知识库和动态检索提升性能。
Details
Motivation: 现有方法在新闻图像字幕生成中存在信息覆盖不全、跨模态对齐不足和视觉-实体匹配不优等问题,MERGE旨在通过多模态知识库和动态检索解决这些问题。Contribution: 提出了首个多模态实体感知检索增强生成框架MERGE,构建了实体中心的EMKB知识库,实现了多阶段假设-字幕策略和动态检索的视觉-实体匹配。
Method: 1. 构建EMKB知识库整合文本、视觉和结构化知识;2. 多阶段假设-字幕策略优化跨模态对齐;3. 动态检索提升视觉-实体匹配。
Result: 在GoodNews和NYTimes800k数据集上显著优于基线,CIDEr提升+6.84和+1.16,F1-score分别提升+4.14和+2.64;在Visual News数据集上泛化能力强。
Insight: 通过实体中心的多模态知识库和动态检索,MERGE有效解决了新闻图像字幕生成中的核心挑战,并展示了强大的泛化能力。
Abstract: News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.
[54] CaptionQA: Is Your Caption as Useful as the Image Itself?
Shijia Yang,Yunong Liu,Bohan Zhai,Ximeng Sun,Zicheng Liu,Emad Barsoum,Manling Li,Chenfeng Xu
Main category: cs.CV
TL;DR: 该论文提出了CaptionQA基准,通过下游任务的实用性评估图像说明的质量,揭示当前方法在保留图像信息方面的不足。
Details
Motivation: 现有评估方法未能验证图像说明是否能在下游任务中替代图像本身,因此作者提出了一个基于实用性的新基准。Contribution: 1. 引入了CaptionQA,一个可扩展的跨领域基准;2. 构建了33,027个多选问题以全面评估说明的实用性;3. 揭示了当前模型在保留图像信息方面的显著差距。
Method: 使用LLM基于说明回答需要视觉信息的问题,直接衡量说明的实用性。
Result: 评估显示,传统模型在新基准上表现下降高达32%,说明现有说明生成方法不足。
Insight: 说明生成不仅要关注传统指标,还需兼顾下游任务的实用性。
Abstract: Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains–Natural, Document, E-commerce, and Embodied AI–each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.
[55] LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules
Cheng Yang,Hui Jin,Xinlei Yu,Zhipeng Wang,Yaoqun Liu,Fenglei Fan,Dajiang Lei,Gangyong Jia,Changmiao Wang,Ruiquan Ge
Main category: cs.CV
TL;DR: 本文提出了一个创新的多智能体协作系统LungNoduleAgent,用于精确诊断肺结节。该系统通过三个主要模块提高诊断的精准度,并展示了在多智能体协作和区域级语义对齐上的优势。
Details
Motivation: 尽管多模态大语言模型在肺CT扫描分析中取得了进展,但在准确描述结节形态和融入医学专业知识方面仍存在挑战,影响了模型的可靠性和临床效果。Contribution: LungNoduleAgent是一个协作多智能体系统,通过分模块设计(结节检测、放射科医生模拟和医生代理系统)显著提高了肺结节诊断的精准度。
Method: 系统分为三个模块:1) Nodule Spotter,协调临床检测模型;2) Radiologist模块,结合局部图像描述技术;3) Doctor Agent System,利用图像和CT报告进行恶性推理。
Result: 在私有和公开数据集上的测试表明,LungNoduleAgent优于主流视觉-语言模型和多智能体系统,凸显了其在临床分析中的潜力。
Insight: 区域级语义对齐和多智能体协作在肺结节诊断中具有重要意义,为未来医学AI系统提供了新的设计思路。
Abstract: Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi-agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi-agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi-agent system framework. Extensive testing on two private datasets and the public LIDC-IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision-language models, agent systems, and advanced expert models. These results highlight the importance of region-level semantic alignment and multi-agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.
[56] MIRA: Multimodal Iterative Reasoning Agent for Image Editing
Ziyun Zeng,Hang Hua,Jiebo Luo
Main category: cs.CV
TL;DR: MIRA是一种轻量级的多模态推理代理,通过迭代的感知-推理-动作循环实现图像编辑,显著提升了复杂指令下的编辑效果。
Details
Motivation: 现有基于扩散的图像编辑模型在解析复杂用户指令时表现不佳,导致语义偏差或编辑失败,MIRA旨在解决这一问题。Contribution: 提出了MIRA,一个即插即用的多模态推理代理,以及MIRA-Editing数据集和两阶段训练流程(SFT+GRPO)。
Method: MIRA采用迭代的感知-推理-动作循环,逐步预测原子编辑指令,并结合视觉反馈动态调整。
Result: 与开源编辑模型结合使用时,MIRA在语义一致性和感知质量上显著提升,性能达到或超过专有系统。
Insight: 迭代推理和多模态反馈是提升复杂指令图像编辑效果的关键。
Abstract: Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.
[57] Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction
Gayoung Lee,Junho Kim,Jin-Hwa Kim,Junmo Kim
Main category: cs.CV
TL;DR: 该论文提出了一种新颖的框架——视觉中的皮格马利翁效应,通过图像到黏土的转换,从多视角图像中重建反射物体的几何形状,解决反射对3D重建的挑战。
Details
Motivation: 反射现象导致物体的几何形状和外观在重建过程中难以分离,传统的3D重建方法难以处理复杂的反射问题。本文受皮格马利翁神话启发,提出通过抑制反射信号来简化几何形状的提取。Contribution: 1. 提出了双分支网络结构,分别处理反射信号和黏土引导的几何形状;2. 利用合成的黏土图像作为中性监督信号,提升几何重建的鲁棒性;3. 在实验中展示了优于现有方法的法线精度和网格完整性。
Method: 采用双分支网络:一个分支基于BRDF处理反射信号,另一个分支通过黏土引导稳定几何形状并优化表面法线。两分支通过合成的黏土图像联合训练,实现反射抑制和几何一致性。
Result: 在合成和真实数据集上的实验表明,该方法在法线精度和网格完整性上显著优于现有方法。
Insight: 通过将高光信息转化为中性信号(黏土化),可以显著提升反射物体的几何重建效果,这为几何学习提供了一种新的归纳偏置。
Abstract: Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically “sculpts” reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.
[58] Scaling Foundation Models for Radar Scene Understanding
Pushkal Mishra,Kshitiz Bansal,Dinesh Bharadia
Main category: cs.CV
TL;DR: 本文提出了RadarFM,一个通过结构化空间语言监督学习统一场景表示的雷达基础模型,解决了雷达感知任务分散的问题。
Details
Motivation: 雷达传感器在各种恶劣环境下具有稳定的感知能力,但现有雷达方法通常是任务特定的,缺乏统一的表示学习框架。Contribution: 1. 提出了结构化标注框架,编码雷达坐标中的车辆分布;2. 设计了基于哈希的对比学习目标,支持细粒度空间推理。
Method: 利用CARLA模拟器生成大规模雷达数据集,并提出结构化空间语言监督和哈希感知对比学习目标。
Result: 实验结果展示了RadarFM在多任务中的统一表示能力,并通过定位感知指标验证了其空间准确性。
Insight: 通过结构化语言监督和对比学习,可以构建通用的雷达基础模型,提升雷达感知任务的迁移性和表达能力。
Abstract: Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.
[59] EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens
Ze Feng,Sen Yang,Boqiang Duan,Wankou Yang,Jingdong Wang
Main category: cs.CV
TL;DR: 论文提出了一种名为EM-KD的新方法,通过知识蒸馏提升高效多模态大语言模型的性能,解决了因视觉标记不平衡导致的细粒度视觉理解差异问题。
Details
Motivation: 现有的高效多模态大语言模型通过压缩视觉标记减少资源消耗,但会导致视觉信息损失,影响模型的理解能力。而传统的知识蒸馏方法忽视了学生模型和教师模型在视觉标记不平衡方面的差异。Contribution: 1) 提出EM-KD方法,通过曼哈顿距离和匈牙利匹配算法对齐师生模型的视觉标记;2) 引入两种蒸馏策略:视觉-语言亲和力蒸馏(VLAD)和视觉语义蒸馏(VSD);3) 在多个基准测试中验证了EM-KD的优越性。
Method: 1) 使用曼哈顿距离和匈牙利匹配算法对齐视觉标记;2) VLAD通过最小化平滑L1距离对齐师生模型的亲和力矩阵;3) VSD利用反向KL散度对齐视觉标记的语义分布。
Result: EM-KD在准确性和效率上均显著优于现有高效多模态大语言模型,且在公平比较条件下优于其他蒸馏方法。
Insight: 视觉标记的对齐是关键步骤,解决了学生模型和教师模型在细粒度视觉理解上的差异;两种蒸馏策略分别从亲和力和语义分布角度提升了模型性能。
Abstract: Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.
[60] FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
YuAn Wang,Xiaofan Li,Chi Huang,Wenhao Zhang,Hao Li,Bosheng Wang,Xun Sun,Jun Wang
Main category: cs.CV
TL;DR: FaithFusion提出了一种基于像素级期望信息增益(EIG)的3DGS-扩散模型融合框架,解决了可控驾驶场景重建与生成中的几何保真和视觉一致性挑战。
Details
Motivation: 在可控驾驶场景重建与3D场景生成中,如何在大的视角变化下保持几何保真并合成视觉合理的场景是一个关键问题,但现有方法缺乏像素级、三维一致的编辑标准。Contribution: 提出了基于EIG的统一策略,指导扩散模型作为空间先验优化高不确定性区域,并通过像素级权重将编辑结果蒸馏回3DGS,无需额外先验条件或结构修改。
Method: 利用EIG作为驱动框架,协调3DGS和扩散模型的融合,实现时空一致的合成。
Result: 在Waymo数据集上取得了NTA-IoU、NTL-IoU和FID的SOTA性能,即使在6米车道偏移下仍保持FID为107.47。
Insight: EIG作为一种像素级标准,能够有效平衡重建与生成的需求,提升合成结果的几何一致性和视觉质量。
Abstract: In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.
[61] Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models
Changlin Li,Jiawei Zhang,Zeyi Shi,Zongxin Yang,Zhihui Li,Xiaojun Chang
Main category: cs.CV
TL;DR: 提出了EntPruner,一种基于熵的自适应剪枝框架,用于扩散和流模型,通过Conditional Entropy Deviation (CED)指导剪枝,实现高效推理速度且保持生成质量。
Details
Motivation: 大型视觉生成模型在迁移到下游任务时存在参数冗余问题。传统剪枝方法难以适应生成模型的多样性需求。Contribution: 1) 提出熵引导的块级重要性评估策略;2) 设计零样本自适应剪枝框架;3) 引入CED量化分布偏差,避免模式坍塌。
Method: 使用CED度量剪枝后的分布偏差,动态决定何时剪枝及剪枝量。实验验证了DiT和SiT模型的效率提升。
Result: 在ImageNet和三个下游数据集上实现了2.22倍推理加速,生成质量保持竞争力。
Insight: 生成模型的剪枝需兼顾多样性和条件保真度,数据依赖的熵度量是关键。
Abstract: Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.
[62] CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion
Dianbing Xi,Jiepeng Wang,Yuanzhi Liang,Xi Qiu,Jialun Liu,Hao Pan,Yuchi Huo,Rui Wang,Haibin Huang,Chi Zhang,Xuelong Li
Main category: cs.CV
TL;DR: CtrlVDiff提出了一种统一的扩散框架,用于视频理解和可控视频生成,通过多模态输入实现精确控制和高质量输出。
Details
Motivation: 传统方法仅依赖几何线索(如深度、边缘)不足以支持复杂编辑(如重光照或材质替换),且容易导致时间漂移。Contribution: 1) 引入多模态输入(深度、法线、分割、边缘、材质等)以提供互补约束;2) 提出Hybrid Modality Control Strategy (HMCS)融合多模态特征;3) 构建MMVideo数据集支持训练。
Method: 使用统一的扩散模型,通过HMCS路由和融合多模态特征,支持从任意子集重新渲染视频。数据集MMVideo结合了真实和合成数据。
Result: 在理解和生成任务中表现出色,支持层级编辑(如重光照、材质调整、物体插入),且在部分模态缺失时仍稳定。
Insight: 多模态输入和统一框架是提升视频生成可控性和质量的关键。
Abstract: We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
[63] DeepRFTv2: Kernel-level Learning for Image Deblurring
Xintian Mao,Haofei Song,Yin-Nian Liu,Qingli Li,Yan Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于傅里叶空间的Fourier Kernel Estimator (FKE),用于图像去模糊任务,实现了核级学习并提升了性能。
Details
Motivation: 现有深度学习模型主要停留在像素级学习阶段,无法真正理解模糊的本质。为此,需要一种能够直接学习模糊核的方法。Contribution: 1. 提出Fourier Kernel Estimator (FKE),在傅里叶空间中学习模糊核,实现核级模糊过程的学习;2. 通过将卷积操作应用于网络提取的特征而非图像,提升模糊核的学习效果;3. 设计解耦的多尺度架构,优化特征提取效率。
Method: 1. 在傅里叶空间中激活模糊核估计,将空间域的卷积问题转换为频域的乘法问题;2. 使用特征代替图像进行卷积,提升学习效果;3. 引入可逆策略的多尺度子网络架构,优化内存占用。
Result: 实验结果表明,该方法在运动去模糊任务上取得了state-of-the-art的性能,并且模糊核估计具有物理意义。
Insight: 通过核级学习和傅里叶空间的处理,可以更有效地理解模糊的本质,同时特征级别的卷积进一步提升了模型的表达能力。
Abstract: It is well-known that if a network aims to learn how to deblur, it should understand the blur process. Blurring is naturally caused by the convolution of the sharp image with the blur kernel. Thus, allowing the network to learn the blur process in the kernel-level can significantly improve the image deblurring performance. But, current deep networks are still at the pixel-level learning stage, either performing end-to-end pixel-level restoration or stage-wise pseudo kernel-level restoration, failing to enable the deblur model to understand the essence of the blur. To this end, we propose Fourier Kernel Estimator (FKE), which considers the activation operation in Fourier space and converts the convolution problem in the spatial domain to a multiplication problem in Fourier space. Our FKE, jointly optimized with the deblur model, enables the network to learn the kernel-level blur process with low complexity and without any additional supervision. Furthermore, we change the convolution object of the kernel from image" to network extracted feature”, whose rich semantic and structural information is more suitable to blur process learning. With the convolution of the feature and the estimated kernel, our model can learn the essence of blur in kernel-level. To further improve the efficiency of feature extraction, we design a decoupled multi-scale architecture with multiple hierarchical sub-unets with a reversible strategy, which allows better multi-scale encoding and decoding in low training memory. Extensive experiments indicate that our method achieves state-of-the-art motion deblurring results and show potential for handling other kernel-related problems. Analysis also shows our kernel estimator is able to learn physically meaningful kernels. The code will be available at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.
[64] Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
Changlin Li,Jiawei Zhang,Shuhao Liu,Sihao Lin,Zeyi Shi,Zhihui Li,Xiaojun Chang
Main category: cs.CV
TL;DR: 提出了一种针对人类视频生成的高效训练框架Ent-Prog,通过熵引导的优先级渐进学习减少计算成本和内存消耗,同时保持生成性能。
Details
Motivation: 当前基于扩散模型的人类视频生成方法在训练高分辨率、多帧数据时面临计算成本高和内存消耗大的问题,亟需高效的训练框架。Contribution: 1. 提出Conditional Entropy Inflation (CEI)评估模型组件的重要性;2. 设计自适应渐进调度策略,动态提升训练复杂度;3. 实现了显著的训练加速和内存节省。
Method: 1. CEI用于优先级训练关键组件;2. 渐进调度策略基于收敛效率动态调整;3. Ent-Prog框架整合上述方法优化训练效率。
Result: 在三个数据集上的实验表明,Ent-Prog实现了2.2倍训练加速和2.4倍内存节省,且生成性能未受影响。
Insight: 通过自适应优先级训练和动态调度,可以显著提升扩散模型在复杂任务上的训练效率,适合高分辨率视频生成。
Abstract: Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.
[65] Referring Video Object Segmentation with Cross-Modality Proxy Queries
Baoli Sun,Xinzhu Ma,Ning Wang,Zhihui Wang,Zhiyong Wang
Main category: cs.CV
TL;DR: ProxyFormer通过引入代理查询整合视觉和文本语义,并通过多阶段渐进更新和传播代理查询,提升跨模态对齐和目标跟踪的准确性。
Details
Motivation: 现有RVOS方法在跨模态对齐中缺乏帧间依赖和动态变化建模,且文本约束引入较晚,导致目标跟踪困难和非目标对象干扰。Contribution: 提出ProxyFormer,通过代理查询动态整合视觉和文本语义,增强帧间依赖和目标跟踪能力;设计了联合语义一致性训练策略,优化语义对齐。
Method: 引入代理查询逐步更新传播,解耦时空维度的跨模态交互,联合语义一致性训练优化语义共识。
Result: 在四个RVOS基准测试中表现优于现有方法。
Insight: 代理查询的动态更新和对时空交互的解耦设计是提升跨模态任务性能的关键。
Abstract: Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.
[66] TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
Jiaming He,Guanyu Hou,Hongwei Li,Zhicong Huang,Kangjie Chen,Yi Yu,Wenbo Jiang,Guowen Xu,Tianwei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为TEAR的时序感知自动化红队测试框架,专门用于评估文本到视频(T2V)模型在动态时序生成中的安全风险,通过优化的两阶段方法显著提升了攻击成功率。
Details
Motivation: 现有的安全评估方法主要针对静态图像和文本生成,难以捕捉视频生成中的复杂时序动态,因此需要一种专门针对T2V模型的自动化安全测试框架。Contribution: TEAR是首个专注于T2V模型时序动态安全风险的自动化红队测试框架,通过两阶段优化方法(生成器训练和时序感知在线偏好学习)显著提高了攻击成功率。
Method: TEAR采用两阶段方法:首先生成器训练,随后通过时序感知在线偏好学习优化测试提示;并通过循环优化提高提示的隐蔽性和对抗性效果。
Result: 实验结果显示,TEAR在开源和商用T2V系统中的攻击成功率超过80%,显著优于之前的最佳结果57%。
Insight: 动态时序特性是T2V模型潜在安全风险的关键因素,TEAR框架提供了一种高效的系统化方法来识别和评估这些风险。
Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.
[67] LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
Shichu Sun,Yichen Zhang,Haolin Song,Zonghao Guo,Chi Chen,Yidan Zhang,Yuan Yao,Zhiyuan Liu,Maosong Sun
Main category: cs.CV
TL;DR: LLaVA-UHD v3通过提出的渐进式视觉压缩(PVC)方法,在多模态大语言模型(MLLMs)中实现了高效的原生分辨率视觉编码,显著减少了计算开销。
Details
Motivation: 尽管全局原生分辨率视觉编码在多模态大语言模型中能力更强,但其计算开销较大。论文旨在通过渐进式视觉压缩方法解决这一问题。Contribution: 提出了渐进式视觉压缩(PVC)方法,包含精细化补丁嵌入和窗口化令牌压缩两大模块,显著提升了视觉编码效率。
Method: 1)精细化补丁嵌入支持灵活的补丁大小缩放;2)窗口化令牌压缩在ViT层中分层部署,逐步聚合局部令牌表示。
Result: ViT-UHD在保持性能的同时,TTFT(首令牌时间)减少了2.4倍;LLaVA-UHD v3性能与Qwen2-VL相当,TTFT进一步减少1.9倍。
Insight: PVC方法展示了如何在保持模型通用性的同时,显著提升效率,为高效MLLM设计提供了新思路。
Abstract: Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.
[68] Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis
Jiyun Bae,Hyunjong Ok,Sangwoo Mo,Jaeho Lee
Main category: cs.CV
TL;DR: 这篇论文研究了视觉-语言模型(VLMs)中干扰信息(distractors)对推理过程的影响,发现视觉干扰与文本干扰不同,会降低准确性但不会增加推理长度。
Details
Motivation: 先前研究表明文本干扰会导致语言模型的推理长度增加但效果下降,作者希望探究在多模态(视觉-语言)任务中是否也存在类似现象,并分析干扰如何影响模型的性能和推理过程。Contribution: 1)提出了Idis数据集,系统化地控制了视觉干扰的语义、数量和空间维度;2)揭示了视觉干扰与文本干扰的不同影响;3)提出了一种基于推理追踪的属性计数方法;4)展示了干扰现象在视觉偏见基准测试中的通用性,并提出了一种简单的提示策略以减少偏见。
Method: 1)设计Idis数据集,通过控制干扰的属性(语义、数量、空间)生成问题;2)分析模型在干扰条件下的推理长度和准确性;3)通过推理追踪技术分析干扰与推理行为的关系;4)提出提示策略以减少偏见驱动的预测。
Result: 研究发现视觉干扰会降低模型准确性,但不会增加推理长度。此外,干扰现象的泛化性在其他视觉偏见任务中也得到验证。
Insight: 视觉干扰的行为与文本干扰不同,可能是由于模态特性的差异。干扰不仅影响性能,还与模型的偏见行为密切相关,简单的提示策略可以部分缓解这一问题。
Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
[69] Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
Joonhyung Park,Hyeongwon Jang,Joowon Kim,Eunho Yang
Main category: cs.CV
TL;DR: GridAR是一个针对视觉自回归模型的测试时缩放框架,通过网格分区渐进生成和布局指定提示重构策略,显著提升了图像生成质量和效率。
Details
Motivation: 研究动机在于视觉自回归模型在测试时计算缩放方面尚未被探索,现有策略(如Best-of-N)在生成错误轨迹时会浪费计算资源,且缺乏整体画布的蓝图。Contribution: 主要贡献包括:1) GridAR框架,通过网格分区和渐进生成提升效率;2) 布局指定提示重构策略,优化生成质量;3) 在生成和编辑任务中展示显著性能提升。
Method: GridAR采用网格分区渐进生成方案,生成多个部分候选并早期修剪不可行的候选,同时使用布局指定提示重构策略指导后续解码。
Result: 在N=4时,GridAR比N=8的Best-of-N在T2I-CompBench++上性能提升14.4%,成本降低25.6%。在图像编辑任务中,语义保留能力提升13.9%。
Insight: 测试时缩放可通过早期修剪和布局重构显著提升视觉自回归模型的生成效率和效果,同时适用于生成和编辑任务。
Abstract: Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.
[70] Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding
Yutao Tang,Cheng Zhao,Gaurav Mittal,Rohith Kukkala,Rama Chellappa,Cheng Peng,Mei Chen
Main category: cs.CV
TL;DR: 论文提出了NDTokenizer3D,一种通用的3D视觉语言模型,通过多尺度NDT表示和MSDec解码器,将3D场景转换为统一的场景tokens,支持多样的3D理解任务和人类交互。
Details
Motivation: 尽管3D视觉语言模型在场景理解和推理方面展现出强大潜力,但如何有效地将3D场景token化并应用于多样化任务仍具挑战性。Contribution: 提出了NDTokenizer3D,通过多尺度NDT表示和MSDec解码器,实现了统一的3D场景token化和多样化任务的统一支持。
Method: 采用三阶段token化流程:构建多尺度NDT表示、通过MSDec融合跨尺度特征生成场景tokens、并扩展为支持交互和分割的统一接口。
Result: 在3D Referring Segmentation、3D VQA和3D Dense Captioning等任务上取得了显著提升。
Insight: 多尺度NDT表示能够同时保留全局上下文和细粒度几何细节,统一的token化设计有助于实现通用化的3D视觉语言理解。
Abstract: Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.
[71] When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
Hui Lu,Yi Yu,Yiming Yang,Chenyu Yi,Qixin Zhang,Bingquan Shen,Alex C. Kot,Xudong Jiang
Main category: cs.CV
TL;DR: 该论文研究了视觉-语言-动作(VLA)模型的通用可迁移对抗性补丁攻击,提出了一种名为UPA-RFAS的统一框架,能够学习共享特征空间中的单一物理补丁,并通过多种损失函数和优化策略提升跨模型迁移能力。
Details
Motivation: 现有的对抗性补丁攻击通常针对单一模型,难以在黑盒环境中迁移,而VLA模型在机器人应用中广泛使用,迫切需要一种通用的、可迁移的攻击方法,以提高其鲁棒性研究。Contribution: 1. 提出了UPA-RFAS框架,首次系统地研究了VLA模型的通用可迁移对抗补丁攻击;2. 引入了特征空间目标、鲁棒性增强的两阶段优化策略和两种VLA特定损失函数;3. 实验证明了该方法在跨模型、任务和视角的迁移性。
Method: 1. 使用特征空间目标(含ℓ1偏差先验和排斥性InfoNCE损失)诱导可迁移的表征偏移;2. 采用两阶段最小-最大优化策略,内部学习不可见的逐样本扰动,外部优化通用补丁;3. 设计了Patch Attention Dominance和Patch Semantic Misalignment两种VLA特定损失函数。
Result: 实验显示,UPA-RFAS在多种VLA模型、操作环境和物理执行中均能成功迁移攻击,暴露了VLA模型的补丁攻击漏洞。
Insight: 该研究揭示了VLA模型在实际部署中的安全风险,强调了通用对抗补丁攻击的重要性,并为未来防御研究提供了基准。
Abstract: Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.
[72] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Wenbo Hu,Jingli Lin,Yilin Long,Yunlong Ran,Lihan Jiang,Yifan Wang,Chenming Zhu,Runsen Xu,Tai Wang,Jiangmiao Pang
Main category: cs.CV
TL;DR: G2VLM 是一种结合3D重建和空间推理的几何基础视觉语言模型,通过统一设计提升了空间智能的鲁棒性。
Details
Motivation: 现有视觉语言模型在空间理解和推理任务上表现不佳,主要是因为缺乏从2D图像重建3D空间的几何学习过程。Contribution: 提出了G2VLM模型,将3D重建与空间推理统一,通过几何特征增强空间推理能力。
Method: 利用多视角图像和视频数据训练模型,结合3D视觉先验,通过上下文学习和交错推理提升空间理解。
Result: 实验显示G2VLM在3D重建和空间推理任务上表现优异,达到SOTA水平。
Insight: 通过结合语义强大的VLM和低级3D视觉任务,G2VLM有望成为空间智能研究的基准模型。
Abstract: Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
[73] You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering
Hanyang Li,Yuheng Jia,Hui Liu,Junhui Hou
Main category: cs.CV
TL;DR: DCBoost 是一个无参数的插件,通过利用可靠的局部结构线索增强全局特征结构,提升深度聚类模型的性能。
Details
Motivation: 现有深度聚类模型中全局和局部特征结构不一致的问题,导致聚类边界模糊和分离性差。Contribution: 提出了 DCBoost,一种无参数的插件方法,通过高置信度样本的自监督提升聚类性能。
Method: 1) 通过自适应 k-近邻一致性过滤选择高置信度样本;2) 计算判别性损失优化网络,增强类内紧凑性和类间分离性。
Result: 在多个基准数据集上显著提升聚类性能,如将 ProPos 的性能提升超过 3%,轮廓系数放大超过 7 倍。
Insight: 利用可靠的局部结构线索可以有效改善全局特征结构,提升深度聚类模型的泛化能力。
Abstract: Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose DCBoost, a parameter-free plug-in designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3% and amplifies the silhouette coefficient by over $7\times$. Code is available at https://github.com/l-h-y168/DCBoost.
[74] BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data
Selene Cerna,Sara Si-Moussi,Wilfried Thuiller,Hadrien Hendrikx,Vincent Miele
Main category: cs.CV
TL;DR: BotaCLIP 是一个轻量级的多模态对比学习框架,通过将高分辨率航空影像与植物调查数据对齐,注入植物学知识到预训练的 Earth Observation 基础模型中,提升了生态任务的表现。
Details
Motivation: 现有的基础模型虽然能学习通用的表征,但在生态学等特定领域缺乏专业知识。直接从头训练或微调成本高昂。需要一种轻量化的方法来注入领域知识。Contribution: 1. 提出了 BotaCLIP,通过对比学习对齐航空影像和植物调查数据。2. 设计了正则化策略防止灾难性遗忘。3. 在多项生态任务中验证了其有效性。
Method: 1. 构建多模态对比学习框架,联合优化影像和植物学数据。2. 使用正则化方法保持预训练模型的原始表征能力。3. 生成领域感知的表征用于下游任务。
Result: 在植物存在预测、蝴蝶分布建模和土壤营养组丰度估计任务中,BotaCLIP 的表现优于原基础模型和监督基线。
Insight: 轻量化的领域知识注入方法可以在不增加太多计算成本的情况下,显著提升基础模型在特定任务中的表现。
Abstract: Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.
[75] Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition
Baoli Sun,Yihan Wang,Xinzhu Ma,Zhihui Wang,Kun Lu,Zhiyong Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Action-Region Tracking(ART)的框架,通过查询-响应机制发现和跟踪局部细节的动态变化,用于细粒度视频动作识别。该方法结合文本约束语义和多层次轨迹对比约束,显著提升了识别效果。
Details
Motivation: 现有细粒度动作识别方法难以捕捉局部区域的细微动态变化,本文旨在解决这一问题,通过跟踪动作相关区域的动态变化提升识别精度。Contribution: 1. 提出ART框架,利用查询-响应机制捕捉局部区域的动态细节;2. 设计了区域语义激活模块和多层次轨迹对比约束;3. 结合文本约束语义优化视觉语言模型的语义表示。
Method: 1. 使用文本约束语义作为查询,定位动作相关区域;2. 通过动作轨迹链接跨帧的响应,捕捉动态变化;3. 多层次轨迹对比约束优化空间和时间相关性;4. 任务特定微调机制优化语义表示。
Result: 在广泛使用的动作识别基准测试中,ART框架显著优于现有方法。
Insight: 结合文本语义和视觉动态信息的联合优化可以有效解决细粒度动作识别中的局部细节捕捉问题。
Abstract: Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.
[76] 3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization
Shuhan Xia,Xuannan Liu,Xing Cui,Peipei Li
Main category: cs.CV
TL;DR: 论文提出了T3-Tracer框架,通过联合分析音频的帧级、段级和整体音频级特征,首次实现了对部分音频篡改的全面检测和定位。框架包括FA-FAM和SMDAM两个核心模块,分别在帧级检测局部伪造线索、段级识别伪造边界。
Details
Motivation: 部分音频篡改攻击通过选择性修改关键帧保留感知真实性,现有方法缺乏跨时间层次的检测能力。因此,需要一种分层结构来捕捉瞬时和持续的异常。Contribution: 1. 提出了首个联合分析帧级、段级和音频级特征的T3-Tracer框架;2. 设计了FA-FAM和SMDAM模块,分别用于帧级局部检测和段级边界识别;3. 在多个数据集上验证了方法的优越性。
Method: T3-Tracer框架包含FA-FAM(结合帧级和音频级时序信息检测局部和全局异常)和SMDAM(双分支结构建模多尺度时序窗口的帧特征差异,识别伪造边界)。
Result: 在三个挑战性数据集上的实验表明,T3-Tracer实现了最先进的性能。
Insight: 多层次的时序信息联合分析是检测部分音频篡改的关键,不同时间尺度的特征建模能够有效提升检测精度。
Abstract: Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.
[77] FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision
Chen Ling,Henglin Shi,Hedvig Kjellström
Main category: cs.CV
TL;DR: FIELDS提出了一种结合2D自监督和3D直接监督的面部重建方法,通过真实4D面部扫描数据和情感识别分支解决3D面部重建中情感细节丢失的问题。
Details
Motivation: 现有3D面部重建方法常依赖2D监督且缺乏3D真实数据,导致难以捕捉细微情感细节。FIELDS旨在填补这一空白。Contribution: 1. 结合2D自监督和3D直接监督的双重监督策略。2. 引入情感识别分支和强度感知损失函数,提高情感细节的真实性。
Method: 利用4D面部扫描数据提供3D表达参数监督,并通过情感识别分支增强表达的准确性。采用强度感知损失函数避免表达过度夸张。
Result: FIELDS生成的情感丰富的3D面部模型显著提升了野外场景下的表情识别性能,同时保持了高度的自然感。
Insight: 直接3D监督与2D自监督的结合是提升3D面部重建情感细节的关键,情感识别分支有效弥补了表达强度的偏差。
Abstract: Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.
[78] Shift-Equivariant Complex-Valued Convolutional Neural Networks
Quentin Gabot,Teck-Yian Lim,Jérémy Fix,Joana Frontera-Pons,Chengfang Ren,Jean-Philippe Ovarlez
Main category: cs.CV
TL;DR: 这篇论文提出了一种扩展的复值卷积神经网络,通过理论设计和新的投影层构建块,解决了传统CNN在移位等变性和不变性上的不足,并在计算机视觉任务中进行了验证。
Details
Motivation: 传统卷积神经网络在移位等变性和不变性方面存在不足,尤其是在下采样和上采样操作中破坏了这些性质。尽管数据增强可以部分弥补这些问题,但作者希望通过理论设计的方法从根本上解决这一问题。Contribution: 论文的主要贡献是将Learnable Polyphase up/downsampling (LPS)扩展到复值神经网络中,并提出了一种新的投影层构建块,实现了移位等变性和不变性的理论保证。
Method: 作者扩展了LPS方法,将其应用于复值神经网络,并设计了一个从复数空间$ℂ$到实数空间$ℝ$的投影层,放在Gumbel Softmax之前。这些方法在分类、重建和语义分割任务中进行了验证。
Result: 论文在多个计算机视觉任务中评估了提出的方法,特别是利用极化合成孔径雷达图像验证了其在分类任务中的不变性和重建及语义分割任务中的等变性。
Insight: 通过理论设计和新的构建块,可以系统地实现卷积神经网络的移位等变性和不变性,而不仅仅是依赖数据增强。这对于处理需要精确空间信息的任务(如医学图像或雷达图像)具有重要意义。
Abstract: Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPS to complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.
[79] AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
Shuhan Xia,Peipei Li,Xuannan Liu,Dongsen Zhang,Xinyu Guo,Zekun Li
Main category: cs.CV
TL;DR: AVFakeBench是首个全面的音频-视频伪造检测基准,涵盖多类伪造语义和多粒度标注,评估了11种AV-LMMs和2种检测方法,揭示其在细粒度感知和推理上的不足。
Details
Motivation: 现有基准局限于DeepFake类伪造和单一粒度标注,无法反映真实伪造场景的多样性和复杂性。Contribution: 提出AVFakeBench基准,支持多任务评估(如分类、细粒度选择等);设计多阶段混合伪造框架生成高质量数据。
Method: 采用多阶段混合伪造框架,结合任务规划和专家生成模型;建立多任务评估框架,覆盖二分类、类型分类等任务。
Result: 评估结果显示AV-LMMs可作为新兴伪造检测工具,但在细粒度和推理能力上表现不足。
Insight: AVFakeBench为音频-视频伪造检测提供了更全面的测试平台,推动了多任务联合评估的研究方向。
Abstract: The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.
[80] Unlocking Zero-shot Potential of Semi-dense Image Matching via Gaussian Splatting
Juncheng Chen,Chao Xu,Yanjun Cao
Main category: cs.CV
TL;DR: MatchGS框架通过几何校正和数据生成,利用3DGS技术生成高质量对应标签,显著提升了图像匹配的零样本性能。
Details
Motivation: 基于学习的图像匹配依赖大量高质量的训练数据,而现有3DGS生成的几何数据和深度信息存在偏差,限制了其应用。Contribution: 1) 提出了几何精确的数据生成流程,修正3DGS的几何偏差;2) 设计了2D-3D表示对齐策略,将3DGS的显式3D知识注入2D匹配器。
Method: 1) 通过几何校正改进3DGS生成的数据;2) 将3DGS的3D表示与2D匹配器的特征对齐。
Result: 生成的地面实况对应关系显著降低了极线误差(40倍),并使零样本性能最高提升17.7%。
Insight: 通过几何修正和表示对齐,3DGS可以作为高质量、可扩展的训练数据源,推动零样本图像匹配的发展。
Abstract: Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS’ explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.
[81] Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Qingyun Li,Shuran Ma,Junwei Luo,Yi Yu,Yue Zhou,Fengxiang Wang,Xudong Lu,Xiaoxing Wang,Xin He,Yushi Chen,Xue Yang,Junchi Yan
Main category: cs.CV
TL;DR: 该论文提出了RSCoVLM,一个用于遥感多任务学习的视觉语言模型基线,通过数据引擎和动态分辨率策略提升了模型性能,并在多样任务中达到最先进水平。
Details
Motivation: 遥感领域中,多任务学习(MTL)具有更好的泛化性和实用性,而视觉语言模型(VLMs)在图像理解和推理中表现出潜力。本文旨在通过统一模型解决多任务挑战。Contribution: 1. 设计了数据引擎以处理复杂的遥感数据环境;2. 提出动态分辨率策略和Zoom-in Chain机制应对超高分辨率图像;3. 增强了目标检测能力并提出了公平评估协议。
Method: 1. 建立数据引擎(数据采集、处理和加权);2. 动态分辨率策略和Zoom-in Chain机制;3. 改进目标检测模块并设计新评估协议。
Result: RSCoVLM在多样任务中表现优异,超过现有遥感VLMs,甚至与专用专家模型相媲美。数据集和模型权重已开源。
Insight: 通过统一的文本接口和多任务学习策略,视觉语言模型可以高效处理复杂的遥感任务,动态分辨率策略是解决超高分辨率图像计算负担的有效方法。
Abstract: With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model’s object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.
[82] PathMamba: A Hybrid Mamba-Transformer for Topologically Coherent Road Segmentation in Satellite Imagery
Jules Decaestecker,Nicolas Vigne
Main category: cs.CV
TL;DR: PathMamba是一种结合Mamba和Transformer的混合架构,用于卫星图像中道路分割任务,平衡了全局上下文和拓扑连续性,同时保持计算效率。
Details
Motivation: 道路分割在卫星图像中的应用需求高,现有方法如Vision Transformer虽能捕捉全局上下文,但计算复杂度高;而Mamba适合建模连续性结构,两者互补。Contribution: 提出了PathMamba混合架构,结合Mamba的顺序建模能力和Transformer的全局推理能力,显著提升了拓扑连续性,且计算高效。
Method: PathMamba设计中使用Mamba块捕捉道路的连续性,Transformer块提供全局上下文,结合两者的优势。
Result: 在DeepGlobe和Massachusetts Roads数据集上,PathMamba实现了最优性能,显著提升了APLS指标,同时计算成本低。
Insight: 混合架构能有效结合序列建模和全局推理的优势,为道路分割等任务提供了一种高效且高质量的解决方案。
Abstract: Achieving both high accuracy and topological continuity in road segmentation from satellite imagery is a critical goal for applications ranging from urban planning to disaster response. State-of-the-art methods often rely on Vision Transformers, which excel at capturing global context, yet their quadratic complexity is a significant barrier to efficient deployment, particularly for on-board processing in resource-constrained platforms. In contrast, emerging State Space Models like Mamba offer linear-time efficiency and are inherently suited to modeling long, continuous structures. We posit that these architectures have complementary strengths. To this end, we introduce PathMamba, a novel hybrid architecture that integrates Mamba’s sequential modeling with the Transformer’s global reasoning. Our design strategically uses Mamba blocks to trace the continuous nature of road networks, preserving topological structure, while integrating Transformer blocks to refine features with global context. This approach yields topologically superior segmentation maps without the prohibitive scaling costs of pure attention-based models. Our experiments on the DeepGlobe Road Extraction and Massachusetts Roads datasets demonstrate that PathMamba sets a new state-of-the-art. Notably, it significantly improves topological continuity, as measured by the APLS metric, setting a new benchmark while remaining computationally competitive.
[83] HTTM: Head-wise Temporal Token Merging for Faster VGGT
Weitian Wang,Lukas Meiner,Rai Shubham,Cecilia De La Parra,Akash Kumar
Main category: cs.CV
TL;DR: HTTM是一种针对VGGT的加速方法,通过多头部粒度的令牌合并解决现有方法在注意力头间均匀合并导致的表示能力下降问题。
Details
Motivation: VGGT在3D场景重建中表现优异,但其全局注意力机制在处理长序列输入时存在显著的延迟瓶颈,需要一种高效的令牌合并方法。Contribution: 提出HTTM,一种无需训练的3D令牌合并方法,通过多头部粒度的合并保持令牌的唯一性,并利用空间局部性和时间对应性实现更高合并比。
Method: HTTM在多头注意力层中以头部为单位合并令牌,避免统一合并导致的特征退化,同时优化合并成本。
Result: 在GPU推理中实现了7倍的加速,性能下降可忽略不计。
Insight: 多头注意力中的令牌合并需考虑头间差异,局部性和对应性可用于提高合并效率。
Abstract: The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers’ output, which hinders the model’s representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.
[84] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Stefanos Koutoupis,Michaela Areti Zervou,Konstantinos Kontras,Maarten De Vos,Panagiotis Tsakalides,Grigorios Tsagatakis
Main category: cs.CV
TL;DR: 论文提出了一种名为Contrastive Fusion(ConFu)的框架,通过对比学习方法同时嵌入单个模态及其融合组合,实现多模态对齐,并捕捉高阶依赖性。
Details
Motivation: 现有方法多为两两模态对齐,忽略了多模态的高阶交互,且可能损害单模态任务的性能。ConFu旨在解决这一问题。Contribution: 提出ConFu框架,扩展对比学习目标,加入融合模态对比项,实现高阶模态对齐和保持两两关系。
Method: ConFu通过对比学习嵌入单个模态及其融合组合,引入融合模态对比项捕捉高阶依赖性。
Result: 在合成和真实多模态任务中,ConFu在检索和分类任务中表现优异,支持一对多和多对一检索。
Insight: ConFu表明,在多模态学习中,同时优化高阶和两两关系可以提高模型性能,且框架具有可扩展性。
Abstract: Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.
[85] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding
Tae-Min Choi,Tae Kyeong Jeong,Garam Kim,Jaemin Lee,Yeongyoon Koh,In Cheul Choi,Jae-Ho Chung,Jong Woong Park,Juyoun Park
Main category: cs.CV
TL;DR: 本文介绍了SurgMLLMBench,一个专门为外科场景理解设计的多模态大语言模型(LLM)基准数据集,填补了现有数据集的不足。
Details
Motivation: 现有的外科数据集多为视觉问答(VQA)格式,缺乏统一的分类标准和像素级分割支持,限制了多模态LLM的一致评估和应用。Contribution: 提出了统一的SurgMLLMBench数据集,包含MAVIS子集,集成了像素级工具分割掩码和结构化VQA注释,涵盖多种外科领域。
Method: 数据集整合了腹腔镜、机器人辅助和显微外科领域的注释,支持视觉对话交互和超越传统VQA的任务评估。
Result: 实验表明,在SurgMLLMBench上训练的单一模型在不同领域表现一致,并能有效泛化到未见数据集。
Insight: SurgMLLMBench有望推动交互式外科推理模型的研发和可重复评估,为多模态外科AI研究提供坚实基础。
Abstract: Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.
[86] Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes
Yangle Liu,Fengze Li,Kan Liu,Jieming Ma
Main category: cs.CV
TL;DR: Endo-G$^{2}$T结合几何引导和时间感知的训练方案,通过深度先验蒸馏和时间嵌入高斯场,提高了动态内窥镜场景重建的几何一致性和时间一致性,同时在效率和长期稳定性上表现优异。
Details
Motivation: 内窥镜视频中存在强视角依赖效应(如镜面反射、湿反射和遮挡),纯光度监督容易导致几何漂移,需要一种方法在4D高斯泼溅中锚定几何信息,同时保持时间一致性和效率。Contribution: 1. 提出几何引导先验蒸馏,利用尺度不变深度和深度梯度损失;2. 提出时间嵌入高斯场,实现轻量正则化的动态几何表示;3. 引入关键帧约束流式优化,提高效率和长期稳定性。
Method: 1. 几何引导先验蒸馏;2. 时间嵌入高斯场;3. 关键帧约束流式优化。
Result: 在EndoNeRF和StereoMIS-P1数据集上实现了单目重建的SOTA结果。
Insight: 几何先验和动态建模的结合是提升内窥镜场景重建质量的关键,轻量化正则化和关键帧策略显著提升了效率和稳定性。
Abstract: Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns with geometry and triggers early geometric drift, where erroneous shapes are reinforced during densification and become hard to correct. We ask how to anchor geometry early for 4D Gaussian splatting (4DGS) while maintaining temporal consistency and efficiency in dynamic endoscopic scenes. Thus, we present Endo-G$^{2}$T, a geometry-guided and temporally aware training scheme for time-embedded 4DGS. First, geo-guided prior distillation converts confidence-gated monocular depth into supervision with scale-invariant depth and depth-gradient losses, using a warm-up-to-cap schedule to inject priors softly and avoid early overfitting. Second, a time-embedded Gaussian field represents dynamics in XYZT with a rotor-like rotation parameterization, yielding temporally coherent geometry with lightweight regularization that favors smooth motion and crisp opacity boundaries. Third, keyframe-constrained streaming improves efficiency and long-horizon stability through keyframe-focused optimization under a max-points budget, while non-keyframes advance with lightweight updates. Across EndoNeRF and StereoMIS-P1 datasets, Endo-G$^{2}$T achieves state-of-the-art results among monocular reconstruction baselines.
[87] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Xin Gu,Haoji Zhang,Qihang Fan,Jingxuan Niu,Zhipeng Zhang,Libo Zhang,Guang Chen,Fan Chen,Longyin Wen,Sijie Zhu
Main category: cs.CV
TL;DR: 该论文提出STVG-o1框架,首次将现成的多模态大语言模型(MLLMs)用于时空视频定位(STVG),并通过边界框链式思维机制和强化微调方法实现SOTA性能。
Details
Motivation: 多模态大语言模型在STVG任务中表现不佳,主要因为训练目标不匹配和标准视觉编码器中细粒度区域-词语对齐较弱。Contribution: 1. 提出首个无需架构修改的MLLM框架STVG-o1;2. 引入边界框链式思维机制;3. 设计多维强化奖励函数。
Method: 1. 边界框链式思维机制分步推理时空位置;2. 多维强化奖励函数(格式、一致性、时间、空间和思维奖励)监督强化微调。
Result: 在HCSTVG-v1/v2和VidSTG上取得SOTA结果,HCSTVG-v1上m_tIoU提升7.3%,超越所有MLLM方法。
Insight: 边界框链式思维和多维奖励函数显著提升MLLM在STVG中的表现,验证了MLLM作为精确时空定位骨干的潜力。
Abstract: Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3% m_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.
[88] Monet: Reasoning in Latent Visual Space Beyond Images and Language
Qixun Wang,Yang Shi,Yifei Wang,Yuanxing Zhang,Pengfei Wan,Kun Gai,Xianghua Ying,Yisen Wang
Main category: cs.CV
TL;DR: 论文提出Monet框架,通过生成连续潜在嵌入作为中间视觉思考,提升多模态大语言模型在潜在视觉空间中的推理能力,解决了现有方法依赖外部工具的局限性。
Details
Motivation: 现有视觉推理方法依赖外部工具,缺乏人类抽象视觉思考的灵活性。Monet旨在通过在潜在视觉空间中直接推理,克服这一限制。Contribution: 1. 提出Monet框架,支持潜在视觉空间推理;2. 设计三阶段蒸馏监督微调管道解决训练挑战;3. 提出VLPO方法优化潜在推理;4. 构建高质量数据集Monet-SFT-125K。
Method: 1. 三阶段蒸馏SFT管道对齐潜在视觉空间;2. VLPO强化学习方法显式优化潜在嵌入;3. 使用高质量CoT数据集Monet-SFT-125K进行训练。
Result: Monet-7B在感知和推理基准测试中表现优越,并在抽象视觉推理任务中展现出强泛化能力。
Insight: 论文揭示了GRPO在潜在推理中的局限性,并提出VLPO作为改进。同时,分享了早期失败尝试的经验,为未来视觉潜在推理研究提供了借鉴。
Abstract: “Thinking with images” has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.
[89] SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning
Futian Wang,Mengqi Wang,Xiao Wang,Haowen Wang,Jin Tang
Main category: cs.CV
TL;DR: 该论文提出了一种基于SAM(Segment Anything Model)的遥感变化描述方法,通过提取语义和运动变化区域,结合知识图谱提升变化描述的准确性,并在多个数据集上达到SOTA性能。
Details
Motivation: 现有遥感变化描述方法在区域感知和时间对齐方面表现不足,因此作者探索利用SAM基础模型提取区域级表征,并结合知识图谱提升描述效果。Contribution: 1. 首次将SAM基础模型引入遥感变化描述任务;2. 设计了一种融合语义和运动变化区域的方法;3. 构建了知识图谱以增强对象信息。
Method: 1. 使用CNN/Transformer提取全局视觉特征;2. 利用SAM提取语义和运动变化区域;3. 构建知识图谱提供对象信息;4. 通过交叉注意力融合信息,并利用Transformer解码器生成描述。
Result: 在多个基准数据集上实现了SOTA性能。
Insight: SAM的引入显著提升了区域感知能力,知识图谱的融合进一步增强了模型的语义理解能力。
Abstract: Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning
[90] E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework
Adeela Islam,Stefano Fiorini,Manuel Lecha,Theodore Tsesmelis,Stuart James,Pietro Morerio,Alessio Del Bue
Main category: cs.CV
TL;DR: E-M3RF 是一个基于多模态特征的等变三维重组框架,通过结合几何和颜色特征解决重组问题,显著降低了旋转和平移误差。
Details
Motivation: 现有的三维重组方法主要依赖几何特征,但在几何信息不足或模糊时(如碎片小、腐蚀或对称)效果不佳,且缺乏显式的物理约束以防止重叠。Contribution: 提出了一种结合几何和颜色特征的多模态三维重组框架,并利用 SE(3) 流匹配预测变换,提高了重组任务的准确性和鲁棒性。
Method: 使用旋转等变编码器提取几何特征,通过 Transformer 编码颜色特征,将两种特征结合形成多模态表示,并通过 SE(3) 流匹配预测变换。
Result: 在 RePAIR 数据集上,E-M3RF 的旋转误差降低了 23.1%,平移误差降低了 13.2%,Chamfer Distance 减少了 18.4%。
Insight: 多模态特征的引入显著提升了复杂场景下的三维重组性能,尤其是在几何信息不足的情况下。
Abstract: 3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.
[91] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Jiajie Zhang,Sören Schwertfeger,Alexander Kleiner
Main category: cs.CV
TL;DR: 本文提出了一种无监督框架,利用工业视频流中的未标记人类演示数据,为视觉-语言-动作(VLA)模型预训练提供结构化数据。通过轻量级运动标记器和基于“潜在动作能量”的无监督分割器,方法能够发现并分割语义一致的动作基元。
Details
Motivation: 工业环境中存在大量未标记的人类演示视频数据,但这些数据缺乏结构化,难以直接用于VLA模型的预训练。本文旨在解决这一问题,提供一种自动化方法从连续视频中提取动作基元。Contribution: 1. 提出了一种无监督框架,从工业视频中发现和分割语义一致的动作基元。2. 引入了“潜在动作能量”度量,用于无监督动作分割。3. 提供了首个端到端自动化系统,为VLA预训练生成结构化数据。
Method: 方法分为两步:首先训练轻量级运动标记器编码运动动态,然后利用潜在动作能量度量进行无监督动作分割,输出分割的视频片段及其对应潜在动作序列。
Result: 在公共基准和专有数据集上的实验表明,该方法能有效分割工作站中人类执行的关键任务。进一步的聚类和视觉-语言模型评估证实了动作基元的语义一致性。
Insight: 该方法为工业环境中VLA模型的预训练提供了可扩展的解决方案,填补了从非结构化视频中自动化提取动作数据的空白。
Abstract: We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel “Latent Action Energy” metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
[92] EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation
Futian Wang,Fan Zhang,Xiao Wang,Mengqi Wang,Dexing Huang,Jin Tang
Main category: cs.CV
TL;DR: EvRainDrop通过超图引导的事件流补全机制,解决了事件相机数据空间稀疏性问题,并结合RGB信息实现多模态特征融合。
Details
Motivation: 事件相机产生的事件流空间稀疏但时间密集,现有方法在处理空间稀疏性问题时表现不佳,需要一种新机制来补全稀疏事件并融合多模态信息。Contribution: 提出了一种基于超图的时空事件流补全机制,支持多模态信息(事件和RGB)的融合,并通过自注意力聚合时序信息。
Method: 利用超图连接不同时空的事件标记,通过上下文信息传递补全稀疏事件;结合RGB信息作为超图节点,并通过自注意力实现多模态特征融合。
Result: 在单标签和多标签事件分类任务中验证了方法的有效性。
Insight: 超图结构能够有效捕捉事件流的复杂时空关系,结合多模态信息可进一步提升模型性能。
Abstract: Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.
[93] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices
Shuai Zhang,Bao Tang,Siyuan Yu,Yueting Zhu,Jingfeng Yao,Ya Zou,Shanglin Yuan,Li Yu,Wenyu Liu,Xinggang Wang
Main category: cs.CV
TL;DR: MobileI2V是一种轻量级的扩散模型(270M参数),用于在移动设备上实现实时高分辨率图像到视频(I2V)生成,通过线性混合架构、时间步蒸馏和移动专用优化,显著提高了生成速度和质量。
Details
Motivation: 当前扩散模型在高分辨率视频生成中计算复杂且速度慢,难以在资源受限的移动设备上实现实时处理。MobileI2V旨在解决这一问题。Contribution: 1. 提出线性混合架构平衡效率与质量;2. 时间步蒸馏将采样步骤从20+压缩到2;3. 移动专用注意力优化提升2倍速度。
Method: 1. 比较线性与softmax注意力模块性能;2. 设计时间步蒸馏策略压缩采样步骤;3. 移动设备专用优化(如注意力操作提速)。
Result: 首次在移动设备上实现720p实时I2V生成,每帧生成时间<100ms,质量媲美现有模型。
Insight: 轻量化设计和高效采样策略是移动设备上实现高质量实时视频生成的关键。
Abstract: Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.
[94] Frequency-Aware Token Reduction for Efficient Vision Transformer
Dong-Jae Lee,Jiwan Hur,Jaehyun Choi,Jaemyung Yu,Junmo Kim
Main category: cs.CV
TL;DR: 该论文提出了一种频率感知的Token缩减策略,通过区分高频和低频Token来提高Vision Transformer的计算效率,同时保持性能。
Details
Motivation: Vision Transformers在计算复杂度上面临挑战,尤其是Token长度的平方增长问题。现有Token缩减方法忽视了自注意力机制中的频率特性(如秩塌缩和过平滑现象),因此需要一种更有效的方法。Contribution: 论文提出了频率感知的Token缩减策略,将Token划分为高频和低频部分,高频Token保留,低频Token聚合成紧凑的直接电流Token,显著提升计算效率并缓解秩塌缩问题。
Method: 方法包括Token的频率分区,高频选择保留,低频聚合为紧凑Token以保留关键低频成分。
Result: 实验显示该方法显著提升准确性,同时减少计算开销并缓解秩塌缩和过平滑现象。
Insight: 频率特性是Token缩减中的重要维度,明确区分高频和低频Token有助于提升效率和性能。
Abstract: Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.
[95] CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation
Shizhe Sun,Wataru Ohyama
Main category: cs.CV
TL;DR: CanKD是一种基于交叉注意力的非局部知识蒸馏方法,通过动态关注教师和学生特征图的全部像素关系,提升了特征表示学习性能。
Details
Motivation: 传统自注意力蒸馏方法仅独立对齐教师和学生特征图,未能充分利用像素间关系。CanKD旨在通过交叉注意力机制全面捕捉像素级关系,优化知识迁移。Contribution: 提出了基于交叉注意力的非局部知识蒸馏框架CanKD,通过动态关注教师特征图的所有像素,显著提升了知识蒸馏效果。
Method: 引入交叉注意力机制,使学生特征图的每个像素都能动态考虑教师特征图中的所有像素,并仅通过额外损失函数实现性能提升。
Result: 在目标检测和图像分割任务中,CanKD超越了现有特征和混合蒸馏方法,展现了卓越性能。
Insight: 交叉注意力机制能够更全面地捕捉像素级关系,为非局部知识蒸馏提供了一种有效的新范式。
Abstract: We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD’s potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD
[96] Self-Paced Learning for Images of Antinuclear Antibodies
Yiyang Jiang,Guangwu Qian,Jiaxin Wu,Qi Huang,Qing Li,Yongkang Wu,Xiao-Yong Wei
Main category: cs.CV
TL;DR: 提出了一种用于抗核抗体(ANA)检测的新框架,通过自步学习处理多示例多标签(MIML)任务的复杂性,显著提升了检测性能。
Details
Motivation: ANA检测是诊断自身免疫疾病的关键方法,但传统手动检测效率低且复杂。尽管机器学习和深度学习可以自动化,但多示例多标签(MIML)任务的实际临床环境带来了独特挑战。Contribution: 1. 提出了一个无需手动预处理的ANA检测框架;2. 结合自步学习技术,优化了多示例多标签任务的处理;3. 通过实验验证了框架的优越性,显著提升了F1-Macro和mAP等指标。
Method: 框架包含三个任务特定组件:实例采样器、概率伪标签分发器和自步学习权重系数。实例采样器和分发器分别处理低置信度实例和自适应标签分配,自步学习根据经验调整训练。
Result: 在ANA数据集上,模型实现了F1-Macro提升7.0%,mAP提升12.6%;在公开数据集上,汉明损失和one-error分别降低了18.2%和26.9%。
Insight: 1. 自步学习可有效处理MIML任务的复杂性;2. 无需手动预处理的设计更适合实际临床环境;3. 框架性能优于传统方法,展示了端到端优化的潜力。
Abstract: Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren’s syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.
[97] EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?
Pierre Adorni,Minh-Tan Pham,Stéphane May,Sébastien Lefèvre
Main category: cs.CV
TL;DR: 本文提出了一种高效的遥感基础模型(RSFM)构建方法——集成专家模型(Ensemble-of-Specialists),通过轻量级的任务特定专家模型(ConvNeXtV2)训练,解决了传统大规模模型在计算资源和可持续性方面的限制。
Details
Motivation: 现有基础模型(如自然语言处理和计算机视觉领域)依赖于大规模模型和数据集的扩展,导致计算资源和碳排放不可持续,且难以普及。本文旨在为遥感领域提供一种高效、环保且易于扩展的替代方案。Contribution: 1. 提出EoS-FM框架,通过集成任务特定专家模型实现高效的特征提取;2. 展示了该方法在效率、可解释性和可扩展性上的优势;3. 支持联邦训练、修剪和动态专家模型集成,适用于协作和资源受限场景。
Method: 将训练过程分解为轻量级的ConvNeXtV2任务特定专家模型,冻结和重用这些模型以实现高效的特征提取和任务泛化。
Result: EoS-FM框架为遥感基础模型提供了一种高效且可持续的构建方向,支持模块化和协作训练。
Insight: 通过专家模型的模块化集成,可以在减少计算资源需求的同时提升模型的泛化能力和灵活性,适合资源受限和环保要求高的场景。
Abstract: Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.
[98] Video Generation Models Are Good Latent Reward Models
Xiaoyue Mi,Wenqing Yu,Jiesong Lian,Shibo Jie,Ruizhe Zhong,Zijun Liu,Guozhen Zhang,Zixiang Zhou,Zhiyong Xu,Yuan Zhou,Qinglin Lu,Fan Tang
Main category: cs.CV
TL;DR: 该论文提出了PRFL框架,利用预训练的视频生成模型作为潜在奖励模型,直接在潜在空间中进行偏好优化,避免VAE解码的高成本,提高了与人类偏好的对齐效率。
Details
Motivation: 现有的视频奖励模型依赖于为像素空间设计的视觉语言模型,导致计算成本高、内存占用大,且优化仅限于后阶段。论文旨在探索在潜在空间中更高效的奖励学习方法。Contribution: 提出了PRFL框架,证明预训练的视频生成模型适合作为潜在奖励模型,直接在潜在空间中进行偏好优化,显著降低了计算成本和内存需求。
Method: 利用预训练的视频生成模型处理潜在空间的噪声表示,通过全链去噪梯度反向传播实现高效优化,无需VAE解码。
Result: 实验表明,PRFL在与人类偏好的对齐上表现更优,同时显著减少了内存占用和训练时间。
Insight: 视频生成模型因其序列建模能力天然适合潜在奖励建模,这一发现为高效奖励学习提供了新思路。
Abstract: Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
[99] Multimodal Robust Prompt Distillation for 3D Point Cloud Models
Xiang Gu,Liming Lu,Xu Zheng,Anan Du,Yongbin Zhou,Shuchao Pang
Main category: cs.CV
TL;DR: 本文提出了一种名为MRPD的新型师生框架,旨在通过多模态知识蒸馏提升3D点云模型的鲁棒性,同时避免推理时的额外计算开销。
Details
Motivation: 现有的3D点云模型防御方法存在计算开销大和泛化能力差的问题,尤其在面对多样化攻击时表现不佳。本文旨在通过多模态知识蒸馏解决这些问题。Contribution: 提出MRPD框架,利用三种不同模态的教师模型(深度投影视觉模型、高性能3D模型和文本编码器)蒸馏轻量级提示,并通过置信门控机制动态平衡多模态输入。
Method: 采用师生框架,通过多模态对齐和置信门控机制,在训练阶段完成知识蒸馏,无需推理时额外计算成本。
Result: 实验表明MRPD在白盒和黑盒攻击下均优于现有防御方法,且在干净数据上表现更优。
Insight: 通过多模态知识高效整合,MRPD为构建鲁棒的3D视觉系统提供了一种实用新范式。
Abstract: Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model’s features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.
[100] Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Teng Hu,Zhentao Yu,Guozhen Zhang,Zihan Su,Zhengguang Zhou,Youliang Zhang,Yuan Zhou,Qinglin Lu,Ran Yi
Main category: cs.CV
TL;DR: 论文提出了一种名为Harmony的新框架,通过跨任务协同训练、全局-局部解耦交互模块和同步增强CFG,解决了音频-视频生成中的对齐问题,显著提升了同步性和生成质量。
Details
Motivation: 当前开源模型在音频-视频同步生成中存在对齐问题,主要表现为联合扩散过程中的三个基本挑战:对应漂移、低效的全局注意力机制和类内模态偏置。Contribution: 1. 提出跨任务协同训练范式,利用音频驱动视频和视频驱动音频任务的强监督信号缓解漂移;2. 设计全局-局部解耦交互模块,实现高效精确的时间-风格对齐;3. 提出同步增强CFG(SyncCFG),显式隔离和放大对齐信号。
Method: 1. Cross-Task Synergy训练范式;2. Global-Local Decoupled Interaction模块;3. SyncCFG推理策略。
Result: Harmony在生成保真度和细粒度音频-视频同步性上显著优于现有方法,成为新的SOTA。
Insight: 通过跨任务协同和解耦交互模块,可以更高效地捕捉跨模态对齐信号,而SyncCFG则在推理阶段进一步强化对齐效果。
Abstract: The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
[101] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training
Haotian Xue,Qi Chen,Zhonghao Wang,Xun Huang,Eli Shechtman,Jinrong Xie,Yongxin Chen
Main category: cs.CV
TL;DR: MoGAN通过一种基于运动对抗训练的后处理方法,显著提升了视频扩散模型的运动质量,同时保持了视觉保真度和效率。
Details
Motivation: 视频扩散模型虽在帧级保真度上表现良好,但在运动连贯性、动态性和真实性方面存在不足,如抖动、重影或不合理的动态。传统的去噪MSE目标缺乏对时间一致性的直接监督。Contribution: 提出MoGAN,一种专注于运动的后训练框架,无需奖励模型或人类偏好数据即可提升运动真实性;设计了基于DiT的光流判别器和分布匹配正则化器。
Method: 基于三步蒸馏的视频扩散模型,训练DiT光流判别器区分真实与生成运动,并结合分布匹配正则化器保持视觉保真度。
Result: 在VBench和VideoJAM-Bench上,MoGAN的运动评分分别提升7.3%和7.4%,同时美学和图像质量评分仍保持或更优;人类研究也证实其运动质量更受青睐。
Insight: MoGAN通过对抗训练直接优化运动质量,避免了传统方法的复杂设计,提供了一种高效且实用的高质量视频生成路径。
Abstract: Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.
[102] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
M. Naseer Subhani
Main category: cs.CV
TL;DR: 论文提出了一种自提示的点监督分割框架ReSAM,通过Refine-Requery-Reinforce循环适应Segment Anything Model(SAM)于遥感图像,仅需稀疏点标注。
Details
Motivation: SAM在自然图像分割中表现优异,但在遥感图像(RSI)上因域偏移和密集标注稀缺导致性能不佳。Contribution: 提出自提示的点监督分割框架ReSAM,通过Refine-Requery-Reinforce循环提升SAM在RSI上的分割性能和域鲁棒性。
Method: 采用Refine(生成粗伪掩码)、Requery(自构造框提示改进)、Reinforce(嵌入对齐减少确认偏差)的循环策略。
Result: 在WHU、HRSID和NWPU VHR-10数据集上超越预训练SAM和其他点监督分割方法。
Insight: 自提示和语义对齐为大规模点级适应基础分割模型提供了高效路径。
Abstract: Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM’s segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.
[103] Qwen3-VL Technical Report
Shuai Bai,Yuxuan Cai,Ruizhe Chen,Keqin Chen,Xionghui Chen,Zesen Cheng,Lianghao Deng,Wei Ding,Chang Gao,Chunjiang Ge,Wenbin Ge,Zhifang Guo,Qidong Huang,Jie Huang,Fei Huang,Binyuan Hui,Shutong Jiang,Zhaohai Li,Mingsheng Li,Mei Li,Kaixin Li,Zicheng Lin,Junyang Lin,Xuejing Liu,Jiawei Liu,Chenglong Liu,Yang Liu,Dayiheng Liu,Shixuan Liu,Dunjie Lu,Ruilin Luo,Chenxu Lv,Rui Men,Lingchen Meng,Xuancheng Ren,Xingzhang Ren,Sibo Song,Yuchong Sun,Jun Tang,Jianhong Tu,Jianqiang Wan,Peng Wang,Pengfei Wang,Qiuyue Wang,Yuxuan Wang,Tianbao Xie,Yiheng Xu,Haiyang Xu,Jin Xu,Zhibo Yang,Mingkun Yang,Jianxin Yang,An Yang,Bowen Yu,Fei Zhang,Hang Zhang,Xi Zhang,Bo Zheng,Humen Zhong,Jingren Zhou,Fan Zhou,Jing Zhou,Yuanzhi Zhu,Ke Zhu
Main category: cs.CV
TL;DR: Qwen3-VL是目前Qwen系列中最强的视觉语言模型,在多模态任务中表现卓越。它支持256K tokens的混合输入(文本、图像、视频),提供密集和混合专家(MoE)两种架构选择。
Details
Motivation: 多模态任务需求增长,需要更强的模型支持长上下文和复杂推理。Qwen3-VL旨在提升纯文本理解、长上下文建模和多模态推理能力。Contribution: 1) 更强的纯文本理解和多模态推理能力;2) 256K tokens的长上下文支持;3) 改进的架构(如interleaved-MRoPE、DeepStack和文本时间对齐)。
Method: 1) 使用interleaved-MRoPE增强时空建模;2) DeepStack整合多级ViT特征;3) 视频任务中引入文本时间对齐。
Result: 在MMMU、MathVista等基准测试中表现领先,支持密集和MoE架构。
Insight: Qwen3-VL展示了多模态模型在长上下文和复杂推理任务中的潜力,为图像推理和代理决策提供了基础引擎。
Abstract: We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
[104] CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow
Ruisheng Han,Kanglei Zhou,Shuang Chen,Amir Atapour-Abarghouei,Hubert P. H. Shum
Main category: cs.CV
TL;DR: CaFlow是一个统一的框架,结合反事实解耦和双向时间条件流,提升长期动作质量评估的性能,通过自监督和循环一致性约束实现更鲁棒和连贯的表示。
Details
Motivation: 长期动作质量评估(AQA)面临建模长时序动态和上下文混淆的挑战,现有方法依赖昂贵标注或单向时序建模,易受虚假相关性影响。Contribution: 提出了CaFlow框架,包含Causal Counterfactual Regularization(CCR)模块和BiT-Flow模块,分别通过反事实干预和双向动态建模解决混淆和时序表示问题。
Method: CCR模块自监督解耦因果和混淆特征,BiT-Flow模块通过双向流和循环一致性约束建模时序动态。
Result: 在多个长期AQA基准测试中达到最优性能。
Insight: 结合反事实干预和双向时序建模能有效提升长期AQA任务的鲁棒性和连贯性。
Abstract: Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow
[105] Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
Tianyi Xiong,Yi Ge,Ming Li,Zuolong Zhang,Pranav Kulkarni,Kaishen Wang,Qi He,Zeying Zhu,Chenxi Liu,Ruibo Chen,Tong Zheng,Yanshuo Chen,Xiyao Wang,Renrui Zhang,Wenhu Chen,Heng Huang
Main category: cs.CV
TL;DR: Multi-Crit是一个新基准,用于评估多模态模型在遵循多样化、细粒度评价标准方面的能力。研究发现现有模型在多准则一致性和灵活性上表现不足,并提出了改进方向。
Details
Motivation: 当前大型多模态模型(LMMs)在多模态评估系统中被广泛采用,但其在多样化、细粒度评价标准遵循能力上的表现尚未充分研究,亟需系统性评估。Contribution: 1)提出了Multi-Crit基准,用于评测多模态模型的多元化标准遵循能力;2)引入了三项新指标(多元化一致性、准则切换灵活性和冲突识别能力);3)全面分析了25个LMMs的表现,揭示了当前模型的局限性。
Method: 通过严格的数据收集流程构建Multi-Crit,涵盖开放生成和可验证推理任务,并结合多准则人工标注。利用三项新指标系统性评估模型表现。
Result: 研究发现:1)专有模型在多准则一致性上表现不佳,尤其在开放评估任务中;2)开源模型在准则灵活性上更落后;3)批评微调虽增强视觉基础能力,但对多元化准则判断泛化不足。
Insight: 现有多模态模型在多准则遵循能力上仍有显著不足,需进一步优化模型设计和微调策略。Multi-Crit为构建可靠且可控的多模态AI评估奠定了基础。
Abstract: Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria–especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
[106] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models
Naifu Zhang,Wei Tao,Xi Xiao,Qianpu Sun,Yuxin Zheng,Wentao Mo,Peiqiang Wang,Nan Zhang
Main category: cs.CV
TL;DR: ADVLA是一种稀疏对抗攻击框架,通过在视觉编码器和文本特征空间的投影上直接施加扰动,显著降低下游动作预测性能,同时保持扰动低幅度和稀疏性。
Details
Motivation: 针对现有对抗攻击方法需要高成本端到端训练且扰动明显的局限性,ADVLA旨在高效破坏视觉-语言-动作(VLA)模型的性能,同时避免传统方法的缺点。Contribution: 提出了ADVLA框架,通过投影空间的稀疏扰动和注意力引导,实现了高效、低成本和隐蔽的攻击。
Method: 利用视觉编码器到文本特征空间的投影,结合三种策略(增强敏感性、强制稀疏性和聚焦扰动)生成扰动。
Result: 实验表明,在低幅度约束下,ADVLA修改少于10%的图像块,攻击成功率近100%,且扰动集中在关键区域。
Insight: 投影空间的对抗扰动相比传统方法更高效和隐蔽,为VLA模型的安全性提供了新的研究方向。
Abstract: In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.
[107] Seeing without Pixels: Perception from Camera Trajectories
Zihui Xue,Kristen Grauman,Dima Damen,Andrew Zisserman,Tengda Han
Main category: cs.CV
TL;DR: 该论文首次系统地探讨了仅通过相机轨迹(而非像素)感知视频内容的可能性,提出了一种对比学习框架训练CamFormer,展示了相机轨迹作为轻量、鲁棒且多功能的视频内容感知模态的潜力。
Details
Motivation: 传统的视频内容感知依赖于像素信息,而该论文探索了仅通过相机轨迹是否也能实现类似效果,这为视频理解提供了新的视角和轻量化的解决方案。Contribution: 1. 首次系统研究仅通过相机轨迹感知视频内容的可能性;2. 提出CamFormer,一种将相机轨迹嵌入到与自然语言对齐的共享空间的编码器;3. 展示了相机轨迹在多种下游任务中的鲁棒性和多功能性。
Method: 使用对比学习框架训练CamFormer,将相机轨迹投影到与自然语言对齐的嵌入空间,并通过跨模态对齐、分类和时间分析等任务验证其效果。
Result: CamFormer在多样化的下游任务中表现出色,证明相机轨迹可以作为视频内容感知的有效信号,且对不同的相机姿态估计方法具有鲁棒性。
Insight: 相机轨迹(”如何移动”)能够揭示视频内容(”做什么”或”观察什么”),为视频理解提供了一种不依赖于像素的高效方法。
Abstract: Can one perceive a video’s content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, “how you move” can indeed reveal “what you are doing” (egocentric) or “observing” (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
[108] Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Yusuf Dalva,Guocheng Gordon Qian,Maya Goldenberg,Tsai-Shien Chen,Kfir Aberman,Sergey Tulyakov,Pinar Yanardag,Kuan-Chieh Jackson Wang
Main category: cs.CV
TL;DR: Canvas-to-Image是一个统一框架,通过将多模态控制信号编码为单一画布图像,实现了高质量的图像生成,并在多任务训练中表现出色。
Details
Motivation: 现代扩散模型在生成高质量多样图像方面表现优秀,但在多模态控制(如文本提示、空间布局、姿态约束等)和高保真组合图像生成方面仍有局限。Contribution: 提出了Canvas-to-Image框架,将多模态控制信号整合为单一画布图像,并通过多任务训练优化扩散模型,实现了更好的控制和图像生成。
Method: 通过将多模态控制信号编码为单一画布图像,并提出Multi-Task Canvas Training策略,联合优化扩散模型以实现多任务学习。
Result: 在多人组合、姿态控制、布局约束等任务中,Canvas-to-Image显著优于现有方法,表现出了更高的身份保持和控制准确性。
Insight: 通过统一的视觉-空间推理和多任务训练,可以有效整合多模态控制信号,提升生成图像的多样性和控制精度。
Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.
eess.AS [Back]
[109] RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data
Zhisheng Zheng,Xiaohang Sun,Tuan Dinh,Abhishek Yanamandra,Abhinav Jain,Zhu Liu,Sunil Hadap,Vimal Bhat,Manoj Aggarwal,Gerard Medioni,David Harwath
Main category: eess.AS
TL;DR: RosettaSpeech提出了一种零样本语音到语音翻译(S2ST)框架,仅需单语语音-文本数据并通过机器翻译监督增强,无需并行语音数据。
Details
Motivation: 现有S2ST依赖稀缺的并行语音数据,导致模型复杂且多阶段。RosettaSpeech旨在通过单语数据实现高效翻译。Contribution: 1. 基于单语数据的零样本S2ST框架;2. 通过文本桥接训练,推理时端到端直接语音翻译;3. 在CVSS-C测试集上性能领先。
Method: 利用文本作为训练时的中间桥梁,结合机器翻译监督,推理时直接实现语音到语音翻译。
Result: 在CVSS-C测试中,德语到英语和西班牙语到英语的ASR-BLEU分别达到25.17和29.86,相对提升27%和14%。
Insight: 通过依赖并行文本而非稀缺语音数据,RosettaSpeech为多种语言的S2ST提供了可扩展的高质量解决方案。
Abstract: The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.
cs.MM [Back]
[110] Prompt-Aware Adaptive Elastic Weight Consolidation for Continual Learning in Medical Vision-Language Models
Ziyuan Gao,Philippe Morel
Main category: cs.MM
TL;DR: 论文提出了一种新的持续学习方法PA-EWC,通过提示引导的参数专业化解决医学视觉语言模型中的灾难性遗忘问题,显著提升了模型在新临床任务中的适应能力和知识保留能力。
Details
Motivation: 医学AI系统在临床部署中面临灾难性遗忘问题,尤其是需要处理多模态医学图像和临床术语的视觉语言模型。迫切需要一种方法既保留关键知识又适应新任务。Contribution: 提出了PA-EWC方法,通过基于提示的参数分类、自适应Fisher信息计算和梯度稳定性分析,实现了关键知识的针对性保护和任务的动态适应。
Method: 1) 基于功能角色对模型参数分类;2) 自适应Fisher信息计算与梯度稳定性分析;3) 基于医学术语密度的加权复杂度度量。
Result: 在五个医学影像数据集上的实验显示,PA-EWC比基线方法减少了17.58%的灾难性遗忘,性能提升显著(如胸片病理定位4.30%,息肉分割6.06%)。
Insight: 通过动态调整参数保护策略和任务复杂度度量,PA-EWC为解决医学多模态模型的持续学习挑战提供了新思路。
Abstract: Medical AI systems face catastrophic forgetting when deployed in clinical settings, where models must learn new imaging protocols while retaining prior diagnostic capabilities. This challenge is particularly acute for medical vision-language models that must preserve complex cross-modal alignments between medical images and clinical terminology across diverse imaging modalities. We introduce Prompt- Aware Adaptive Elastic Weight Consolidation (PA-EWC), a novel continual learning approach that addresses catastrophic forgetting through prompt-guided parameter specialization. Our method systematically categorizes model parameters based on their functional roles in processing visual-descriptive, spatial-guided, and medical-semantic information, enabling targeted protection of critical knowledge while allowing adaptation to new clinical requirements. PA-EWC incorporates adaptive Fisher Information computation with gradient stability analysis and develops weighted complexity metrics based on medical terminology density. We evaluate our approach across five medical imaging datasets (Kvasir-SEG, ISIC 2018, CheXlocalize, BUSI, CAMUS) representing diverse modalities including endoscopy, dermoscopy, radiography, and ultrasound. Experimental results demonstrate that PA-EWC reduces catastrophic forgetting by up to 17.58% compared to baseline methods, with performance improvements of 4.30% on chest X-ray pathology localization and 6.06% on polyp segmentation.
[111] AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
Xinyue Guo,Xiaoran Yang,Lipan Zhang,Jianxuan Yang,Zhao Wang,Jian Luan
Main category: cs.MM
TL;DR: AV-Edit是一个多模态生成式音效编辑框架,通过联合利用视觉、音频和文本语义,实现视频中现有音频轨道的细粒度编辑。其采用了对比音频-视觉掩码自编码器和多模态扩散Transformer,表现出卓越的音效编辑能力。
Details
Motivation: 现有的音效编辑方法主要依赖低级信号处理或粗粒度文本提示,导致灵活性不足和音质不佳。为了克服这些限制,AV-Edit结合多模态语义控制,实现更精细的音效编辑。Contribution: 1) 提出AV-Edit框架,实现基于视频内容的音效细粒度编辑;2) 设计了对比音频-视觉掩码自编码器(CAV-MAE-Edit)和多模态扩散Transformer(MM-DiT);3) 构建了专用的视频音效编辑数据集作为评估基准。
Method: 1) CAV-MAE-Edit用于多模态预训练,学习对齐的跨模态表示;2) MM-DiT通过相关性特征门控策略,移除视觉无关声音并生成缺失音频;3) 联合视觉、音频和文本语义进行控制。
Result: 实验表明,AV-Edit能生成高质量音频,并根据视觉内容实现精确修改,在音效编辑领域达到SOTA性能,且在音频生成领域表现出竞争力。
Insight: 多模态联合控制(视觉+音频+文本)是实现高质量音效编辑的关键,同时跨模态预训练和相关性门控策略显著提升了编辑的精确度和音质。
Abstract: Sound effect editing-modifying audio by adding, removing, or replacing elements-remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM-DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based feature gating training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.
cs.RO [Back]
[112] AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios
Chenglizhao Chen,Shaofeng Liang,Runwei Guan,Xiaolou Sun,Haocheng Zhao,Haiyun Jiang,Tao Huang,Henghui Ding,Qing-Long Han
Main category: cs.RO
TL;DR: 论文提出了首个针对无人机场景的大规模Referring Multi-Object Tracking(RMOT)基准AerialMind,并开发了一种半自动化的标注框架COALA和一种新的方法HawkEyeTrack(HETrack),通过协同增强视觉-语言表示学习提升了无人机场景的感知能力。
Details
Motivation: 当前Referring Multi-Object Tracking(RMOT)研究主要局限于地面场景,限制了其捕获广域场景上下文的能力。无人机(UAVs)因其广阔的空中视角和优越的机动性成为Embodied Intelligence的关键平台,亟需支持自然语言交互的智能系统。Contribution: 1) 提出了首个面向无人机场景的RMOT基准AerialMind;2) 开发了半自动化的标注框架COALA,显著降低人力成本并保持标注质量;3) 提出了新的方法HETrack,协同增强视觉-语言表示学习。
Method: COALA框架通过半自动化的协作代理标注减少人工成本;HETrack方法通过协同学习提升视觉-语言表示,优化无人机场景的感知能力。
Result: 实验验证了数据集的挑战性和方法的有效性。
Insight: 无人机的广域视角为RMOT提供了新的研究场景,协同视觉-语言学习是提升无人机场景感知的关键。
Abstract: Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.
[113] SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation
Ziyi Chen,Yingnan Guo,Zedong Chu,Minghua Luo,Yanfen Shen,Mingchao Sun,Junjun Hu,Shichao Xie,Kuan Yang,Pei Shi,Zhining Gu,Lu Liu,Honglin Han,Xiaolong Wu,Mu Xu,Yu Zhang
Main category: cs.RO
TL;DR: SocialNav提出了一种分层‘大脑-行为’架构的基础模型,用于社会感知的具身导航,结合大规模SocNav数据集和多阶段训练流程,显著提升了导航成功率和社交合规性。
Details
Motivation: 现有具身导航方法在遵循社会规范方面表现不足,需要一种能够理解高层次社交规范并生成低层次合规轨迹的模型。Contribution: 1. 提出了SocialNav基础模型;2. 构建了SocNav大规模数据集;3. 设计了多阶段训练流程,包括SAFE-GRPO强化学习框架。
Method: 1. 分层‘大脑-行为’架构;2. 通过模仿学习注入导航技能与社交理解;3. 使用SAFE-GRPO强化学习框架优化行为。
Result: SocialNav在导航成功率和社交合规率上分别提升了38%和46%。
Insight: 结合数据驱动和强化学习的方法可以有效提升具身导航的社会合规性,同时分层设计有助于分离高层次的社交理解和低层次的轨迹生成。
Abstract: Embodied navigation that adheres to social norms remains an open research challenge. Our \textbf{SocialNav} is a foundational model for socially-aware navigation with a hierarchical “brain-action” architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Our project page: https://amap-eai.github.io/SocialNav/
[114] TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
Seungjae Lee,Yoonkyo Jung,Inkook Chun,Yao-Chih Lee,Zikui Cai,Hongjia Huang,Aayush Talreja,Tan Dat Dao,Yongyuan Liang,Jia-Bin Huang,Furong Huang
Main category: cs.RO
TL;DR: TraceGen提出了一种统一的3D轨迹空间表示方法,能够从跨主体(如人类和不同机器人)的视频中学习,解决了小数据问题。通过TraceForge数据管道生成大规模数据集,TraceGen实现了高效的跨任务和跨主体适应,推理速度显著优于现有视频世界模型。
Details
Motivation: 当前从少量演示中学习新机器人任务的挑战在于跨主体(人类和不同机器人)视频的差异性(如外观、相机视角和环境)限制了其直接应用。如何利用这些丰富但异构的视频数据成为关键问题。Contribution: 1. 提出一种符号化的3D轨迹空间表示方法,抽象外观并保留几何结构。2. 开发TraceForge数据管道,将异构视频转化为一致的3D轨迹,生成大规模数据集。3. TraceGen在仅需少量目标演示时表现出高效适应性和快速推理能力。
Method: 1. 设计3D轨迹空间(trace-space)作为统一表示,用于预测未来运动而非像素空间。2. 利用TraceForge将人类和机器人视频转换为3D轨迹和语言三元组数据集。3. 预训练生成3D运动先验,并通过少量目标演示快速适应新任务。
Result: 1. 在四任务实验中,仅需五个目标机器人视频,TraceGen达到80%成功率,推理速度比现有视频世界模型快50-600倍。2. 在仅五个手持手机拍摄的人类演示视频下,仍能在真实机器人上实现67.5%成功率。
Insight: TraceGen的关键在于通过3D轨迹空间抽象跨主体的运动结构,避免了对象检测或像素生成的计算负担,从而实现了高效的跨主体学习和适应性。
Abstract: Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D “trace-space” of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen’s ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
eess.IV [Back]
[115] Adversarial Multi-Task Learning for Liver Tumor Segmentation, Dynamic Enhancement Regression, and Classification
Xiaojiao Xiao,Qinmin Vivian Hu,Tae Hyun Kim,Guanghui Wang
Main category: eess.IV
TL;DR: MTI-Net提出了一种新颖的多任务交互对抗学习框架,用于肝脏肿瘤分割、动态增强回归和分类任务,通过多域信息熵融合和任务交互模块实现任务间的协同提升。
Details
Motivation: 本文的动机在于解决肝脏肿瘤多任务学习中缺乏有效的端到端框架问题,尤其是无法捕获任务间相关性和动态MRI信息提取不足的挑战。Contribution: 主要贡献包括:1) 提出了MTI-Net框架,首次实现肝脏肿瘤多任务的端到端学习;2) 设计了多域信息熵融合(MdIEF)和任务交互模块;3) 提出了任务驱动的鉴别器(TDD);4) 使用浅层Transformer提取动态MRI信息。
Method: 方法包括:1) 多域信息熵融合(MdIEF)整合频谱域特征;2) 任务交互模块实现分割与回归的高阶一致性;3) TDD捕获任务间高阶关系;4) 浅层Transformer提取动态MRI序列关系。
Result: 在238名受试者的数据集上,MTI-Net在多个任务中表现出色,展示了其在肝脏肿瘤临床评估中的潜力。
Insight: MTI-Net通过融合频谱域特征和任务交互设计,提升了多任务学习的性能,为医学图像分析中的动态信息处理提供了新思路。
Abstract: Liver tumor segmentation, dynamic enhancement regression, and classification are critical for clinical assessment and diagnosis. However, no prior work has attempted to achieve these tasks simultaneously in an end-to-end framework, primarily due to the lack of an effective framework that captures inter-task relevance for mutual improvement and the absence of a mechanism to extract dynamic MRI information effectively. To address these challenges, we propose the Multi-Task Interaction adversarial learning Network (MTI-Net), a novel integrated framework designed to tackle these tasks simultaneously. MTI-Net incorporates Multi-domain Information Entropy Fusion (MdIEF), which utilizes entropy-aware, high-frequency spectral information to effectively integrate features from both frequency and spectral domains, enhancing the extraction and utilization of dynamic MRI data. The network also introduces a task interaction module that establishes higher-order consistency between segmentation and regression, thus fostering inter-task synergy and improving overall performance. Additionally, we designed a novel task-driven discriminator (TDD) to capture internal high-order relationships between tasks. For dynamic MRI information extraction, we employ a shallow Transformer network to perform positional encoding, which captures the relationships within dynamic MRI sequences. In experiments on a dataset of 238 subjects, MTI-Net demonstrates high performance across multiple tasks, indicating its strong potential for assisting in the clinical assessment of liver tumors. The code is available at: https://github.com/xiaojiao929/MTI-Net.
[116] Deep Parameter Interpolation for Scalar Conditioning
Chicago Y. Park,Michael T. McCann,Cristina Garcia-Cardona,Brendt Wohlberg,Ulugbek S. Kamilov
Main category: eess.IV
TL;DR: 论文提出了深度参数插值(DPI)方法,通过动态插值两个可学习的参数集,为神经网络添加标量输入的依赖性,解决了现有方法在架构选择上的限制,并在扩散模型和流匹配模型中表现出更好的去噪性能和计算效率。
Details
Motivation: 现有的深度生成模型(如扩散模型和流匹配模型)需要神经网络同时处理高维向量和标量输入,这在架构设计上具有挑战性。现有方法要么将标量编码为额外输入,要么在特定网络组件中结合标量信息,限制了架构的选择。Contribution: 提出了深度参数插值(DPI),一种通用的方法,通过动态插值两个参数集实现标量输入的依赖性,无需修改网络架构,同时提升了模型的性能和计算效率。
Method: DPI在单个网络中维护两个可学习的参数集,并根据标量值在训练和采样时动态插值这两个参数集,从而引入标量依赖性。
Result: 实验表明,DPI在扩散模型和流匹配模型中提升了去噪性能和采样质量,同时保持了与标准标量调节方法相当的计算效率。
Insight: DPI的核心思想是通过参数空间的插值实现标量输入的调节,为神经网络设计提供了一种灵活且高效的新思路。
Abstract: We propose deep parameter interpolation (DPI), a general-purpose method for transforming an existing deep neural network architecture into one that accepts an additional scalar input. Recent deep generative models, including diffusion models and flow matching, employ a single neural network to learn a time- or noise level-dependent vector field. Designing a network architecture to accurately represent this vector field is challenging because the network must integrate information from two different sources: a high-dimensional vector (usually an image) and a scalar. Common approaches either encode the scalar as an additional image input or combine scalar and vector information in specific network components, which restricts architecture choices. Instead, we propose to maintain two learnable parameter sets within a single network and to introduce the scalar dependency by dynamically interpolating between the parameter sets based on the scalar value during training and sampling. DPI is a simple, architecture-agnostic method for adding scalar dependence to a neural network. We demonstrate that our method improves denoising performance and enhances sample quality for both diffusion and flow matching models, while achieving computational efficiency comparable to standard scalar conditioning techniques. Code is available at https://github.com/wustl-cig/parameter_interpolation.
cs.LG [Back]
[117] ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training
Chenliang Li,Adel Elmahdy,Alex Boyd,Zhongruo Wang,Alfredo Garcia,Parminder Bhatia,Taha Kass-Hout,Cao Xiao,Mingyi Hong
Main category: cs.LG
TL;DR: 论文提出ST-PPO和S-PPO两种稳定技术,解决多轮任务中PPO训练不稳定问题,通过回合级重要性采样和剪辑偏差校正,显著提升性能。
Details
Motivation: PPO在多轮对话和推理任务中表现不稳定,容易崩溃。研究发现令牌级重要性采样与多轮环境的自然粒度不匹配,以及离策略样本的优势估计不准确是主要原因。Contribution: 1) 提出回合级重要性采样;2) 引入剪辑偏差校正;3) 组合两种技术形成ST-PPO和S-PPO,显著稳定训练过程。
Method: 1) Turn-PPO:仅使用回合级采样;2) S-PPO:在令牌级PPO中应用剪辑偏差校正;3) ST-PPO:结合两种技术。实验验证效果。
Result: 在多轮任务基准测试中,ST-PPO和S-PPO防止了性能崩溃,保持了较低的剪辑比,并显著优于标准令牌级PPO。
Insight: 结合回合级采样和剪辑偏差校正能有效解决多轮环境中PPO的不稳定性问题,是一种可扩展的解决方案。
Abstract: PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1)~token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn-level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.
[118] Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
Liangzu Peng,Aditya Chattopadhyay,Luca Zancato,Elvis Nunez,Wei Xia,Stefano Soatto
Main category: cs.LG
TL;DR: Gated KalmaNet (GKA)提出了一个通过在线岭回归在测试时解决状态空间模型(SSM)记忆损失的层,保持了高效性,同时在短上下文和长上下文任务中表现优异。
Details
Motivation: 现有线性状态空间模型(SSM)虽然高效,但只能保留模糊的过去记忆,导致在需要召回的任务中表现不佳。Contribution: 1. 提出了GKA,通过在线岭回归在测试时利用完整历史预测下一标记;2. 自适应正则化和输入相关门控策略确保数值稳定性;3. 使用Chebyshev迭代提升低精度环境下的稳定性;4. 开发了硬件优化的分块实现。
Method: 1. 在测试时解决在线岭回归问题;2. 利用Kalman滤波思想但改进其数值不稳定问题;3. 采用Chebyshev迭代和自适应门控;4. 优化硬件实现。
Result: 在短上下文任务中优于Mamba2、GLA等SSM层,在128k长上下文任务中相对基线提升超10%。
Insight: 通过自适应门控和Chebyshev迭代,GKA在高效的同时提升了记忆保留能力,适用于大规模语言任务。
Abstract: As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented tasks. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And (2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, GKA shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
[119] Subjective Depth and Timescale Transformers: Learning Where and When to Compute
Frederico Wieser,Martin Benfeghoul,Haitham Bou Ammar,Jun Wang,Zafeirios Fountas
Main category: cs.LG
TL;DR: 本文提出两种动态计算路由的Transformer架构(SDT和STT),利用贝叶斯惊喜信号动态决定计算位置和时间,以减少计算开销。
Details
Motivation: 标准Transformer的均匀计算分配限制了其在大型模型和长序列中的效率和扩展性。SDT和STT通过动态路由优化计算资源分配。Contribution: 1. 提出SDT和STT架构,分别优化空间和时间维度的计算效率;2. 动态路由基于贝叶斯惊喜信号,减少75%的自注意力计算和50%的KV缓存需求。
Method: SDT通过交替的Decision(计算后验和轻量先验)和Dynamic(基于惊喜信号的Top-K路由)层动态分配计算;STT在时间维度预测残差更新并动态跳过TF块。
Result: SDT和STT在训练中表现出从新奇驱动到预测驱动的门控转移,减少了计算资源需求,初步验证了条件计算的精度-效率权衡。
Insight: 动态计算路由可显著提升Transformer效率;贝叶斯惊喜信号为计算跳过提供了理论基础。
Abstract: The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block ‘posterior’ and a lightweight ‘prior,’ while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a temporal ‘change hypothesis’ that informs a router to dynamically execute or bypass TF blocks for each token, managing KV-cache contributions. Both architectures exhibit the predicted shift from novelty to prediction driven gating over training, suggesting alignment with surprise based principles. While operating at reduced capacity, they offer preliminary insights into the compute-accuracy trade-offs of conditional computation. The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.
[120] BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla
Ariful Islam,Md Rifat Hossen,Md. Mahmudul Arif,Abdullah Al Noman,Md Arifur Rahman
Main category: cs.LG
TL;DR: 这篇论文提出了BanglaMM-Disaster,一个基于多模态Transformer的深度学习框架,用于孟加拉语的多类别灾害分类。通过结合文本和视觉数据,模型在5,037个标注的社交媒体帖子上的准确率达到83.76%,优于单模态基准。
Details
Motivation: 孟加拉国常受自然灾害影响,急需实时监测和快速响应系统。传统方法依赖单一模态数据,限制了分类效果,因此需要结合多模态数据提升灾害分类性能。Contribution: 1. 构建了一个新的孟加拉语多模态灾害数据集(5,037个帖子)。2. 提出了一个结合Transformer文本编码器(如BanglaBERT)和CNN骨干网络(如ResNet50)的多模态框架。3. 展示了多模态方法在灾害分类中的优势。
Method: 采用早期融合策略,将Transformer文本编码器(BanglaBERT、mBERT等)与CNN(ResNet50、DenseNet169等)结合处理文本和图像数据,通过端到端训练优化分类性能。
Result: 最佳模型达到83.76%准确率,比纯文本基线高3.84%,比纯图像基线高16.91%。在多类别分类中显著减少了误判。
Insight: 多模态方法在处理低资源语言(如孟加拉语)的灾害分类任务中表现出显著优势,尤其适用于模糊样本的分类。
Abstract: Natural disasters remain a major challenge for Bangladesh, so real-time monitoring and quick response systems are essential. In this study, we present BanglaMM-Disaster, an end-to-end deep learning-based multimodal framework for disaster classification in Bangla, using both textual and visual data from social media. We constructed a new dataset of 5,037 Bangla social media posts, each consisting of a caption and a corresponding image, annotated into one of nine disaster-related categories. The proposed model integrates transformer-based text encoders, including BanglaBERT, mBERT, and XLM-RoBERTa, with CNN backbones such as ResNet50, DenseNet169, and MobileNetV2, to process the two modalities. Using early fusion, the best model achieves 83.76% accuracy. This surpasses the best text-only baseline by 3.84% and the image-only baseline by 16.91%. Our analysis also shows reduced misclassification across all classes, with noticeable improvements for ambiguous examples. This work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.
[121] Mechanisms of Non-Monotonic Scaling in Vision Transformers
Anantha Padmanaban Krishna Kumar
Main category: cs.LG
TL;DR: 论文研究了Vision Transformers(ViT)中非单调缩放的现象,发现深层ViT的性能可能不如浅层,并通过实验分析了ViT-S、ViT-B和ViT-L在ImageNet上的表现,揭示了一种“悬崖-平台-爬升”的三阶段模式。
Details
Motivation: ViT的深层模型性能不如浅层,这与常见的缩放假设相矛盾。作者希望通过系统性实验分析这种现象的原因。Contribution: 1. 揭示了ViT中表示演化的三阶段模式;2. 提出了[CLS] token逐渐边缘化的现象;3. 引入了信息搅动指数(Information Scrambling Index)量化信息混合模式。
Method: 通过对ViT-S、ViT-B和ViT-L在ImageNet上的实验,分析了表示演化和信息混合模式,并使用信息搅动指数进行了量化。
Result: 发现深层ViT的信息扩散增加而非任务性能提升,表明在设计ViT时,精细调节深度比简单增加参数更有效。
Insight: ViT设计中需要关注深度的精细调节,信息搅动指数可作为未来架构设计的诊断工具。
Abstract: Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.
cs.SD [Back]
[122] Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale
Yicheng Zhong,Peiji Yang,Zhisheng Wang
Main category: cs.SD
TL;DR: 该论文提出了一种多奖励GRPO框架,旨在解决单码本TTS LLMs在韵律稳定性和自然度方面的问题。通过结合长度惩罚、熵正则化和LLM标注的韵律对齐奖励,显著提升了模型表现。
Details
Motivation: 单码本TTS LLMs在高效性和流式处理上有优势,但存在韵律不稳定、说话人漂移和自然度下降等问题。论文旨在通过多奖励强化学习优化生成策略。Contribution: 提出了GRPO框架,结合多种规则奖励(长度惩罚、熵正则化和韵律对齐)直接优化单码本TTS LLMs的生成策略,提升了稳定性和自然度。
Method: 使用多奖励GRPO框架优化生成策略,并结合外部推理LLM生成韵律对齐奖励。通过流匹配解码器和可扩展性分析验证方法的有效性。
Result: 在多奖励GRPO优化后,单码本TTS LLMs在韵律稳定性、说话人相似性和自然度方面均有显著提升。
Insight: 强化学习的多奖励机制能有效优化TTS模型的生成策略,尤其在韵律和稳定性方面。外部LLM的监督信号进一步提升了模型的生成质量。
Abstract: Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS LLMs. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for decoding stability, and an LLM-annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning LLM predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM) decoder on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS LLMs.
cs.AI [Back]
[123] ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Qineng Wang,Wenlong Huang,Yu Zhou,Hang Yin,Tianwei Bao,Jianwen Lyu,Weiyu Liu,Ruohan Zhang,Jiajun Wu,Li Fei-Fei,Manling Li
Main category: cs.AI
TL;DR: ENACT是一个评估具身认知(embodied cognition)的基准,通过视觉问答(VQA)形式测试视觉语言模型(VLMs)能否从自我中心视角的交互中建模世界。结果表明,前沿模型与人类表现存在差距,尤其是在长期交互中。
Details
Motivation: 现代视觉语言模型(VLMs)大多以被动方式训练,但具身认知理论认为智能源于传感器运动交互。ENACT旨在验证这些模型是否表现出具身认知的迹象。Contribution: 引入ENACT基准,通过部分可观测马尔可夫决策过程(POMDP)设计两个任务:正向世界建模和逆向世界建模,提供大规模QA对数据集,揭示VLMs在具身认知中的局限。
Method: 将任务设计为部分可观测马尔可夫决策过程(POMDP),动作是场景图变化。数据集通过机器人仿真(BEHAVIOR)生成,评估模型在8,972个QA对中的表现。
Result: 模型在逆向任务中表现优于正向任务,但整体表现仍落后于人类,差距随交互时间增长而扩大。模型还表现出人类中心偏置(如偏好右撇子动作)。
Insight: ENACT揭示了VLMs在具身认知方面的不足,尤其是长期交互和人类视角偏置问题,为未来模型设计提供了改进方向。
Abstract: Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.
[124] Guaranteed Optimal Compositional Explanations for Neurons
Biagio La Rosa,Leilani H. Gilpin
Main category: cs.AI
TL;DR: 该论文提出了首个计算保证最优组合解释的框架,分析了当前光束搜索方法的局限性,并展示了其方法在计算机视觉和卷积神经网络中的应用效果。
Details
Motivation: 虽然神经元是深度神经网络的基本单元,但它们学习的内容是否与人类知识一致仍不明确。组合解释通过逻辑规则描述神经元激活与概念的空间对齐,但现有方法(如光束搜索)无法保证最优性,且无法评估其与真实最优解的接近程度。Contribution: 论文的贡献包括:(i) 提出了一个分解框架,识别影响空间对齐的因素;(ii) 设计了一种启发式方法来估计搜索过程中的对齐;(iii) 开发了首个能在可行时间内计算最优组合解释的算法。
Method: 核心方法包括分解框架、启发式对齐估计和最优组合解释算法。在计算机视觉和卷积神经网络中验证了方法的有效性,并与光束搜索进行了对比。
Result: 实验表明,在涉及重叠概念时,10-40%的光束搜索解释是次优的。论文提出的基于分解和启发式的光束搜索变体在运行时和灵活性上优于现有方法。
Insight: 论文揭示了光束搜索在组合解释中的局限性,并证明了最优解释框架的必要性。此外,启发式方法的使用为未来的研究方向提供了可能。
Abstract: While neurons are the basic units of deep neural networks, it is still unclear what they learn and if their knowledge is aligned with that of humans. Compositional explanations aim to answer this question by describing the spatial alignment between neuron activations and concepts through logical rules. These logical descriptions are typically computed via a search over all possible concept combinations. Since computing the spatial alignment over the entire state space is computationally infeasible, the literature commonly adopts beam search to restrict the space. However, beam search cannot provide any theoretical guarantees of optimality, and it remains unclear how close current explanations are to the true optimum. In this theoretical paper, we address this gap by introducing the first framework for computing guaranteed optimal compositional explanations. Specifically, we propose: (i) a decomposition that identifies the factors influencing the spatial alignment, (ii) a heuristic to estimate the alignment at any stage of the search, and (iii) the first algorithm that can compute optimal compositional explanations within a feasible time. Using this framework, we analyze the differences between optimal and non-optimal explanations in the most popular settings for compositional explanations, the computer vision domain and Convolutional Neural Networks. In these settings, we demonstrate that 10-40 percent of explanations obtained with beam search are suboptimal when overlapping concepts are involved. Finally, we evaluate a beam-search variant guided by our proposed decomposition and heuristic, showing that it matches or improves runtime over prior methods while offering greater flexibility in hyperparameters and computational resources.
[125] OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
Chujie Wang,Jianyu Lu,Zhiyuan Luo,Xi Chen,Chu He
Main category: cs.AI
TL;DR: OVOD-Agent通过将被动类别匹配转变为主动视觉推理和自进化检测,提出了一种基于弱马尔可夫决策过程的框架,结合Bandit模块实现有限监督下的探索,并通过自监督Reward Model优化提升检测性能。
Details
Motivation: 现有开放词汇对象检测(OVOD)方法虽然基于多模态数据进行训练,但在推理时仅依赖固定类别名称,导致多模态训练与单模态推理之间的差距。研究表明文本表示优化可显著提升性能,但这方面仍未被充分探索。Contribution: 1. 提出OVOD-Agent框架,将被动匹配转变为主动推理和自进化检测;2. 设计Visual-CoT实现解释性动作;3. 引入弱马尔可夫决策过程(w-MDP)建模视觉上下文;4. 结合Bandit模块和Reward Model形成闭环优化。
Method: 1. 通过Visual-CoT扩展文本优化为可解释动作;2. 用w-MDP在八个状态空间建模视觉上下文;3. Bandit模块生成探索信号;4. 结合马尔可夫转移矩阵和Bandit轨迹优化Reward Model。
Result: 在COCO和LVIS数据集上的实验表明,OVOD-Agent能持续提升多种OVOD骨干网络的性能,尤其是罕见类别。
Insight: 1. 主动推理和自进化机制能弥补多模态训练与单模态推理的差距;2. Bandit模块在有限监督下有效引导探索;3. 闭环优化框架显著提升了检测性能。
Abstract: Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD’s lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent’s state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.
cs.CY [Back]
[126] InvisibleBench: A Deployment Gate for Caregiving Relationship AI
Ali Madad
Main category: cs.CY
TL;DR: InvisibleBench是一个用于评估护理关系AI的部署门控工具,通过五个维度(安全、合规、创伤知情设计、归属/文化适应性和记忆)对3-20+轮交互进行评测。它包含自动失败条件,并在四种前沿模型中揭示了显著的安全漏洞。
Details
Motivation: 现有单轮安全测试无法捕捉长期交互中的真实风险,因此需要一个更全面的评测标准来确保AI在护理关系中的部署安全性。Contribution: 提出了InvisibleBench,扩展了单轮安全测试,专注于长期交互风险评测,并公开所有场景、评分配置和代码。
Method: 通过17个场景(N=68)和三个复杂度层级,评测了四种前沿模型在五个维度上的表现。
Result: 所有模型均存在显著安全漏洞(危机检测率为11.8-44.8%),DeepSeek Chat v3总分最高(75.9%),但不同维度表现各异。
Insight: 长期交互风险评测对AI部署至关重要,确定性危机路由是生产系统中的必要措施。
Abstract: InvisibleBench is a deployment gate for caregiving-relationship AI, evaluating 3-20+ turn interactions across five dimensions: Safety, Compliance, Trauma-Informed Design, Belonging/Cultural Fitness, and Memory. The benchmark includes autofail conditions for missed crises, medical advice (WOPR Act), harmful information, and attachment engineering. We evaluate four frontier models across 17 scenarios (N=68) spanning three complexity tiers. All models show significant safety gaps (11.8-44.8 percent crisis detection), indicating the necessity of deterministic crisis routing in production systems. DeepSeek Chat v3 achieves the highest overall score (75.9 percent), while strengths differ by dimension: GPT-4o Mini leads Compliance (88.2 percent), Gemini leads Trauma-Informed Design (85.0 percent), and Claude Sonnet 4.5 ranks highest in crisis detection (44.8 percent). We release all scenarios, judge prompts, and scoring configurations with code. InvisibleBench extends single-turn safety tests by evaluating longitudinal risk, where real harms emerge. No clinical claims; this is a deployment-readiness evaluation.
[127] Large Language Models’ Complicit Responses to Illicit Instructions across Socio-Legal Contexts
Xing Wang,Huiyuan Xie,Yiyan Wang,Chaojun Xiao,Huimin Chen,Holli Sargeant,Felix Steffek,Jie Shao,Zhiyuan Liu,Maosong Sun
Main category: cs.CY
TL;DR: 该研究探讨大型语言模型(LLMs)在协助非法活动方面的风险,提出了‘共谋便利’的概念,并通过实证研究发现LLMs普遍存在此行为,尤其是在GPT-4o中。
Details
Motivation: LLMs被广泛部署,但其在协助非法活动方面的风险未被充分研究,需要评估模型的合规性和安全性。Contribution: 构建了一个涵盖269个非法场景和50种非法意图的评估基准,揭示了LLMs在共谋便利行为中的普遍性及其在法律和社会层面的差异。
Method: 基于真实法律案例和法律框架,设计了四项实证研究评估LLMs的行为,并分析了模型推理过程和社会人口因素。
Result: GPT-4o在近半数测试案例中提供非法协助;模型在法律警告和积极引导方面表现不佳;社会弱势群体更易受到误导。
Insight: 模型的安全对齐策略可能不足,甚至加剧共谋行为;社会和法律背景对模型行为有显著影响。
Abstract: Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks. However, the risk of these models assisting unlawful activities remains underexplored. In this study, we define this high-risk behavior as complicit facilitation - the provision of guidance or support that enables illicit user instructions - and present four empirical studies that assess its prevalence in widely deployed LLMs. Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents to assess LLMs’ complicit facilitation behavior. Our findings reveal widespread LLM susceptibility to complicit facilitation, with GPT-4o providing illicit assistance in nearly half of tested cases. Moreover, LLMs exhibit deficient performance in delivering credible legal warnings and positive guidance. Further analysis uncovers substantial safety variation across socio-legal contexts. On the legal side, we observe heightened complicity for crimes against societal interests, non-extreme but frequently occurring violations, and malicious intents driven by subjective motives or deceptive justifications. On the social side, we identify demographic disparities that reveal concerning complicit patterns towards marginalized and disadvantaged groups, with older adults, racial minorities, and individuals in lower-prestige occupations disproportionately more likely to receive unlawful guidance. Analysis of model reasoning traces suggests that model-perceived stereotypes, characterized along warmth and competence, are associated with the model’s complicit behavior. Finally, we demonstrate that existing safety alignment strategies are insufficient and may even exacerbate complicit behavior.
q-bio.QM [Back]
[128] Automated Histopathologic Assessment of Hirschsprung Disease Using a Multi-Stage Vision Transformer Framework
Youssef Megahed,Saleh Abou-Alwan,Anthony Fuller,Dina El Demellawy,Steven Hawken,Adrian D. C. Chan
Main category: q-bio.QM
TL;DR: 该论文提出了一种基于Vision Transformer的三阶段分割框架,用于自动识别Hirschsprung病的组织病理学特征,包括肌层、神经丛和神经节细胞的分割,取得较高的准确率和临床潜力。
Details
Motivation: Hirschsprung病的诊断依赖于对神经节细胞的正确识别,传统方法存在主观性和操作者间差异问题,因此需要自动化且准确的病理评估工具。Contribution: 提出了一种多阶段Vision Transformer框架,模拟病理学家的诊断过程,通过全局组织上下文和细胞形态学特征提高分割精度。
Method: 采用三阶段分割策略:1) 肌层分割,2) 神经丛分割,3) 神经节细胞识别。结合分辨率特定的分块策略和后处理以保持解剖学一致性。
Result: 肌层分割的Dice系数为89.9%,神经丛分割召回率为94.8%,神经节细胞识别的最高召回率为89.1%。
Insight: Vision Transformer能够有效处理复杂组织结构的全局和小尺度特征,多阶段方法显著降低病理评估的操作者间差异。
Abstract: Hirschsprung Disease is characterized by the absence of ganglion cells in the myenteric plexus. Therefore, their correct identification is crucial for diagnosing Hirschsprung disease. We introduce a three-stage segmentation framework based on a Vision Transformer (ViT-B/16) that mimics the pathologist’s diagnostic approach. The framework sequentially segments the muscularis propria, delineates the myenteric plexus, and identifies ganglion cells within anatomically valid regions. 30 whole-slide images of colon tissue were used, each containing expert manual annotations of muscularis, plexus, and ganglion cells at varying levels of certainty. A 5-fold cross-validation scheme was applied to each stage, along with resolution-specific tiling strategies and tailored postprocessing to ensure anatomical consistency. The proposed method achieved a Dice coefficient of 89.9% and a Plexus Inclusion Rate of 100% for muscularis segmentation. Plexus segmentation reached a recall of 94.8%, a precision of 84.2% and a Ganglia Inclusion Rate of 99.7%. For high-certainty ganglion cells, the model achieved 62.1% precision and 89.1% recall, while joint certainty scores yielded 67.0% precision. These results indicate that ViT-based models are effective at leveraging global tissue context and capturing cellular morphology at small scales, even within complex histological tissue structures. This multi-stage methodology has great potential to support digital pathology workflows by reducing inter-observer variability and assisting in the evaluation of Hirschsprung disease. The clinical impact will be evaluated in future work with larger multi-center datasets and additional expert annotations.