Table of Contents

cs.CL [Back]

[1] MemeMind: A Large-Scale Multimodal Dataset with Chain-of-Thought Reasoning for Harmful Meme Detection

Hexiang Gu,Qifan Yu,Saihui Hou,Zhiqin Fang,Huijia Wu,Zhaofeng He

Main category: cs.CL

TL;DR: 介绍了MemeMind数据集和MemeGuard框架,用于有害模因检测,通过Chain-of-Thought注释和多模态建模提升模型性能。

Details Motivation: 社交媒体的快速发展加剧了有害内容的传播,现有数据集缺乏系统性和解释性,阻碍了有害模因检测的进展。

Contribution: 提出了MemeMind数据集(大规模、多语言、带CoT注释)和MemeGuard框架(结合多模态与推理建模)。

Method: 数据集包含详细的Chain-of-Thought注释,框架整合多模态信息与推理过程建模。

Result: MemeGuard在实验中显著优于现有方法。

Insight: Chain-of-Thought注释和多模态建模对有害模因检测具有重要价值。

Abstract: The rapid development of social media has intensified the spread of harmful content. Harmful memes, which integrate both images and text, pose significant challenges for automated detection due to their implicit semantics and complex multimodal interactions. Although existing research has made progress in detection accuracy and interpretability, the lack of a systematic, large-scale, diverse, and highly explainable dataset continues to hinder further advancement in this field. To address this gap, we introduce MemeMind, a novel dataset featuring scientifically rigorous standards, large scale, diversity, bilingual support (Chinese and English), and detailed Chain-of-Thought (CoT) annotations. MemeMind fills critical gaps in current datasets by offering comprehensive labeling and explicit reasoning traces, thereby providing a solid foundation for enhancing harmful meme detection. In addition, we propose an innovative detection framework, MemeGuard, which effectively integrates multimodal information with reasoning process modeling, significantly improving models’ ability to understand and identify harmful memes. Extensive experiments conducted on the MemeMind dataset demonstrate that MemeGuard consistently outperforms existing state-of-the-art methods in harmful meme detection tasks.

[2] Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge

Sahil Kale,Vijaykant Nadadur

Main category: cs.CL

TL;DR: 该研究揭示了大型语言模型(LLM)将记忆误认为推理能力的问题,导致其在自知识评估中表现过高的自信,尤其在STEM领域。

Details Motivation: 当前研究将记忆和自知识缺陷视为独立问题,忽视了它们之间的联系,这影响了LLM回答的可信度。研究旨在揭示LLM是否真正学习推理模式或仅是记忆训练数据中的解决方案。

Contribution: 提出了一种新框架,用于评估LLM是否从训练数据中学习推理模式或仅记忆解决方案。揭示了LLM在自知识评估中的过度自信问题,特别是在科学和医学领域。

Method: 通过分析LLM在面对逻辑一致的任务扰动时的行为,评估其自知识评估的一致性。研究聚焦于STEM领域,并记录了LLM在可行性评估中的不一致性。

Result: 研究发现LLM在自知识评估中存在显著不一致性(>45%),尤其是在科学和医学领域。这表明LLM过度依赖记忆解决方案,导致推理能力被高估。

Insight: 研究揭示了LLM在记忆与推理能力之间的混淆问题,突显了当前架构和训练模式的缺陷,强调需要开发新技术以提高模型对其自身知识的平衡和一致认知。

Abstract: When artificial intelligence mistakes memorization for intelligence, it creates a dangerous mirage of reasoning. Existing studies treat memorization and self-knowledge deficits in LLMs as separate issues and do not recognize an intertwining link that degrades the trustworthiness of LLM responses. In our study, we utilize a novel framework to ascertain if LLMs genuinely learn reasoning patterns from training data or merely memorize them to assume competence across problems of similar complexity focused on STEM domains. Our analysis shows a noteworthy problem in generalization: LLMs draw confidence from memorized solutions to infer a higher self-knowledge about their reasoning ability, which manifests as an over 45% inconsistency in feasibility assessments when faced with self-validated, logically coherent task perturbations. This effect is most pronounced in science and medicine domains, which tend to have maximal standardized jargon and problems, further confirming our approach. Significant wavering within the self-knowledge of LLMs also shows flaws in current architectures and training patterns, highlighting the need for techniques that ensure a balanced, consistent stance on models’ perceptions of their own knowledge for maximum AI explainability and trustworthiness. Our code and results are available publicly at https://github.com/knowledge-verse-ai/LLM-Memorization_SK_Eval-.

[3] Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Brian Siyuan Zheng,Alisa Liu,Orevaoghene Ahia,Jonathan Hayase,Yejin Choi,Noah A. Smith

Main category: cs.CL

TL;DR: 这篇论文研究发现,语言模型(LMs)对非规范分词(non-canonical tokenizations)表现出惊人的鲁棒性,即使这些分词从未在训练中见过。指令调优模型在这种情况下仍能保留高达93.4%的性能,且在某些任务中非规范分词反而能提升表现(例如字符级分词在字符串操作和代码任务中提升14%)。鲁棒性主要来源于指令调优阶段。

Details Motivation: 现代分词器使用确定性算法将文本映射为单一“规范”分词序列,但同一字符串可能通过分词器词汇表编码为多种不同的非规范分词。论文探讨了语言模型对这种非规范分词的鲁棒性。

Contribution: 1. 发现语言模型对非规范分词具有高度鲁棒性(如指令调优模型保留93.4%性能);2. 展示在某些任务中,非规范分词可提升性能(如字符级分词和右对齐数字分组);3. 揭示了鲁棒性来源于指令调优阶段。

Method: 1. 评估语言模型在20个基准测试中面对非规范分词的性能变化;2. 探究不同分词方式(如字符级、随机分词)对任务的影响;3. 分析鲁棒性来源,对比基础模型和指令调优模型的反应。

Result: 1. 指令调优模型在随机分词下保留93.4%性能,字符级分词下保留90.8%;2. 字符级分词在字符串和代码任务中提升14%,数字分组在大数运算中提升33%。

Insight: 1. 模型并非如先前认为的那样依赖分词器;2. 指令调优赋予模型理解非规范分词语义的能力,而基础模型会生成无意义输出;3. 推断时干预分词方式可提升特定任务性能。

Abstract: Modern tokenizers employ deterministic algorithms to map text into a single “canonical” token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can improve performance, finding that character-level segmentation improves string manipulation and code understanding tasks by up to +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We show that while both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings), base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less tied to their tokenizer than previously believed, and demonstrate the promise of intervening on tokenization at inference time to boost performance.

[4] Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

Weijie Xu,Yiwen Wang,Chi Xue,Xiangkun Hu,Xi Fang,Guimin Dong,Chandan K. Reddy

Main category: cs.CL

TL;DR: 本文提出FiSCo框架,用于评估大语言模型(LLM)的公平性,通过语义和统计方法检测长文本中的细微偏见。

Details Motivation: 现有方法难以捕捉长文本中的偏见和LLM输出的内在变异性,FiSCo旨在解决这一问题。

Contribution: 提出了FiSCo框架,定义新的群体反事实公平性,并通过语义分解和统计假设检验检测偏见。

Method: 将模型输出分解为语义不同的断言(claim),并利用蕴含检查和统计假设检验比较群体间相似性。

Result: 实验显示FiSCo能更可靠地识别微妙偏见,减少LLM随机性的影响。

Insight: 超越词级分析,FiSCo通过语义一致性评估公平性,为LLM偏见检测提供了新视角。

Abstract: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo(Fine-grained Semantic Computation), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSco more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.

[5] NLPnorth @ TalentCLEF 2025: Comparing Discriminative, Contrastive, and Prompt-Based Methods for Job Title and Skill Matching

Mike Zhang,Rob van der Goot

Main category: cs.CL

TL;DR: 该论文比较了分类、对比和提示方法在职位匹配和技能预测任务中的表现,发现提示方法在职位匹配中表现最佳,而分类方法在技能预测中更优。

Details Motivation: 研究职位匹配和技能预测在计算职位市场中的重要性,以改进自动候选人匹配、职业路径预测和职位市场分析等任务。

Contribution: 对比了分类、对比和提示方法在多语言职位匹配和基于职位的技能预测中的表现,并利用额外数据增强了模型性能。

Method: 使用了(微调)分类、对比和提示方法,结合ESCO的多语言职位和技能描述数据。

Result: 提示方法在职位匹配任务(Task A)中表现最佳(MAP: 0.492),分类方法在技能预测任务(Task B)中更优(MAP: 0.290)。

Insight: 大型多语言模型在两项任务中表现最佳,提示方法更适合职位匹配,而分类方法更适合技能预测。

Abstract: Matching job titles is a highly relevant task in the computational job market domain, as it improves e.g., automatic candidate matching, career path prediction, and job market analysis. Furthermore, aligning job titles to job skills can be considered an extension to this task, with similar relevance for the same downstream tasks. In this report, we outline NLPnorth’s submission to TalentCLEF 2025, which includes both of these tasks: Multilingual Job Title Matching, and Job Title-Based Skill Prediction. For both tasks we compare (fine-tuned) classification-based, (fine-tuned) contrastive-based, and prompting methods. We observe that for Task A, our prompting approach performs best with an average of 0.492 mean average precision (MAP) on test data, averaged over English, Spanish, and German. For Task B, we obtain an MAP of 0.290 on test data with our fine-tuned classification-based approach. Additionally, we made use of extra data by pulling all the language-specific titles and corresponding \emph{descriptions} from ESCO for each job and skill. Overall, we find that the largest multilingual language models perform best for both tasks. Per the provisional results and only counting the unique teams, the ranking on Task A is 5$^{\text{th}}$/20 and for Task B 3$^{\text{rd}}$/14.

[6] MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanation

Jackson Trager,Francielle Vargas,Diego Alves,Matteo Guida,Mikel K. Ngueajio,Ameeta Agrawal,Flor Plaza-del-Arco,Yalda Daryanai,Farzan Karimi-Malekabadi

Main category: cs.CL

TL;DR: MFTCXplain是一个多语言基准数据集,用于通过仇恨言论的多跳解释评估LLMs的道德推理能力,揭示了LLMs在道德推理方面的局限性。

Details Motivation: 当前评估LLMs道德推理能力的基准存在两大缺陷:缺乏合理的标注以支持道德分类,以及主要集中于英语,限制了多文化背景下的评估。

Contribution: 提出了MFTCXplain数据集,包含多种语言的仇恨言论标注、道德类别和文本级理由,填补了多语言道德推理评估的空白。

Method: 使用Moral Foundation Theory (MFT)对3,000条推文进行标注,包括仇恨言论标签、道德类别和文本级理由。

Result: LLMs在仇恨言论检测上表现较好(F1达0.836),但在道德情感预测上表现较差(F1 < 0.35),且对少数语言的理由对齐能力有限。

Insight: 当前LLMs在理解和反映人类道德推理方面能力有限,尤其是在多语言和跨文化背景下。

Abstract: Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.

[7] Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Nathaniel Getachew,Abulhair Saparov

Main category: cs.CL

TL;DR: 该论文提出了一个名为 $ exttt{StorySim}$ 的框架,用于合成生成故事以评估大型语言模型(LLMs)的心智理论(ToM)和世界建模(WM)能力。研究发现,模型在 WM 任务上表现优于 ToM 任务,并且在推理人类行为时表现更好。此外,发现了启发式行为的证据,如近因偏差和对早期事件的过度依赖。

Details Motivation: 现有基准测试可能因预训练数据污染而影响评估效果。需要一种可控的框架来精确评估 LLMs 的心智理论和世界建模能力。

Contribution: 1. 提出了 $ exttt{StorySim}$ 框架,生成新颖、可组合的故事提示,用于评估 ToM 和 WM;2. 揭示了 LLMs 在 ToM 任务上的局限性以及对人类推理的优势;3. 发现了模型的启发式行为模式。

Method: 1. 使用 $ exttt{Storyboard}$ 设计可控的故事生成框架;2. 设计了一阶和二阶 ToM 任务及 WM 任务;3. 在多个 LLMs 上进行实验评估。

Result: 1. 大多数模型在 WM 任务上表现优于 ToM 任务;2. 模型对人类的推理能力优于对无生命对象的推理;3. 发现模型存在近因偏差和依赖早期事件的行为。

Insight: 1. 当前 LLMs 的心智理论能力仍有不足;2. 框架的可控性有助于揭示模型的局限性;3. 启发式行为可能是模型推理的潜在缺陷。

Abstract: We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

[8] Human-Aligned Faithfulness in Toxicity Explanations of LLMs

Ramaravind K. Mothilal,Joanna Roy,Syed Ishtiaque Ahmed,Shion Guha

Main category: cs.CL

TL;DR: 该论文提出了一种新颖的人类对齐忠实性(HAF)评价准则,用于评估大语言模型(LLMs)生成的毒性解释的合理性,并通过六个度量标准量化其与人类理性解释的一致性。

Details Motivation: 现有的解释性方法过度依赖输入文本扰动,难以直接评估LLMs生成的自由形式毒性解释的合理性。论文旨在填补这一空白,提升LLMs在下游任务中的可信度。

Contribution: 提出了HAF评价准则及六个相关度量标准,无需人类参与即可全面评估LLMs毒性解释的合理性,揭示了模型在复杂提示下的推理缺陷。

Method: 基于不确定性量化开发六个度量标准,结合多组实验验证HAF在Llama和Ministral模型上的适用性。

Result: 实验表明,LLMs在简单提示下生成合理的解释,但在涉及复杂关系和微妙原因的提示下,推理能力崩溃,导致不一致和无意义的回答。

Insight: LLMs在毒性解释任务中的表现受提示复杂性影响显著,提示设计对其推理能力至关重要。

Abstract: The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs’ reasoning about toxicity – from their explanations that justify a stance – to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs’ free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate \haf of LLMs’ toxicity explanations with no human involvement, and highlight how “non-ideal” the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and nonsensical responses. We open-source our code and LLM-generated explanations at https://github.com/uofthcdslab/HAF.

[9] Augmenting Multi-Agent Communication with State Delta Trajectory

Yichen Tang,Weihang Su,Yujia Zhou,Yiqun Liu,Min Zhang,Shaoping Ma,Qingyao Ai

Main category: cs.CL

TL;DR: 论文提出了一种新的多智能体通信协议——状态增量轨迹(State Delta Trajectory),通过传递自然语言标记和标记级状态转移轨迹来减少信息损失,提升了多智能体系统的性能,尤其在复杂推理任务中表现突出。

Details Motivation: 现有基于大语言模型(LLM)的多智能体系统主要依赖自然语言进行通信,虽然简单可解释,但会导致信息损失,尤其是推理逻辑或抽象思维这类信息的传递。

Contribution: 提出了状态增量编码(SDE)方法,通过传递标记级状态转移轨迹(而非实际状态值)来更有效地反映推理过程中的隐藏信息,从而改进多智能体通信。

Method: 设计了一种新的通信协议,结合自然语言标记和SDE方法传递状态增量轨迹,实验验证了其在多智能体系统中的有效性。

Result: 实验结果表明,使用SDE的多智能体系统在复杂推理任务中达到了SOTA性能。

Insight: 状态变化序列(而非静态状态值)能更好地捕捉推理过程的动态信息,为多智能体通信优化提供了新思路。

Abstract: Multi-agent techniques such as role playing or multi-turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing LLM-based multi-agent systems mostly use natural language for agent communication. While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to concrete tokens before transferring them to the other model. Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts. To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another. Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process, so we propose a State Delta Encoding (SDE) method to represent state transition trajectories. The experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning. This shows the potential of communication augmentation for LLM-based multi-agent systems.

[10] Personality Prediction from Life Stories using Language Models

Rasiq Hussain,Jerry Ma,Rithik Khandelwal,Joshua Oltmanns,Mehak Gupta

Main category: cs.CL

TL;DR: 本文提出了一种结合预训练语言模型和注意力机制的两步方法,用于从长篇生活故事中预测五大人格特质,优于现有长上下文模型。

Details Motivation: 传统人格评估依赖于问卷,缺乏丰富性和开放性。NLP技术可以利用长篇叙事文本,提供更自然的人格评估方式。

Contribution: 提出了一种结合滑动窗口微调预训练模型和RNN注意力机制的两步方法,显著提升了长篇文本的人格预测性能。

Method: 1. 使用滑动窗口技术对预训练语言模型进行微调,提取上下文嵌入;2. 采用带注意力机制的RNN,整合长距离依赖关系。

Result: 通过消融实验和与LLaMA、Longformer等模型的对比,证明了该方法在准确性、效率和可解释性上的提升。

Insight: 结合语言特征和长上下文建模能够更好地从叙事文本中提取人格特质,推动基于语言的人格评估发展。

Abstract: Natural Language Processing (NLP) offers new avenues for personality assessment by leveraging rich, open-ended text, moving beyond traditional questionnaires. In this study, we address the challenge of modeling long narrative interview where each exceeds 2000 tokens so as to predict Five-Factor Model (FFM) personality traits. We propose a two-step approach: first, we extract contextual embeddings using sliding-window fine-tuning of pretrained language models; then, we apply Recurrent Neural Networks (RNNs) with attention mechanisms to integrate long-range dependencies and enhance interpretability. This hybrid method effectively bridges the strengths of pretrained transformers and sequence modeling to handle long-context data. Through ablation studies and comparisons with state-of-the-art long-context models such as LLaMA and Longformer, we demonstrate improvements in prediction accuracy, efficiency, and interpretability. Our results highlight the potential of combining language-based features with long-context modeling to advance personality assessment from life narratives.

[11] What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning

Yuchang Zhu,Zhonghua zhen,Qunshu Lin,Haotong Wei,Xiaolong Sun,Zixuan Yu,Minghao Liu,Zibin Zheng,Liang Chen

Main category: cs.CL

TL;DR: 这篇论文研究了LLM生成数据的多样性对下游模型性能的影响,发现适度多样性的数据可以提升模型性能,而高度多样性的数据则可能产生负面影响。

Details Motivation: 随着LLM生成能力的提升,利用其生成数据来训练下游模型成为一种缓解数据稀缺和减少标注时间的方法。然而,自生成数据迭代训练可能导致模型性能下降(模型崩溃),但现有研究往往忽视了数据多样性的重要性。

Contribution: 本文的主要贡献在于揭示了LLM生成数据的多样性对下游模型性能的影响,并提供了关于如何平衡数据多样性和性能的实证结果。

Method: 作者通过实验探讨了不同多样性的LLM生成数据对下游模型性能的影响,并研究了混合真实数据与合成数据时模型的表现。

Result: 实验结果表明,在分布偏移最小的情况下,适度多样性的LLM生成数据可以提升模型性能(尤其是在标记数据不足时),而高度多样性的数据则会损害性能。

Insight: 数据多样性是影响模型性能的关键因素,但需要找到一个平衡点:适度多样性的生成数据更为有效,而过度追求多样性可能适得其反。

Abstract: With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that mixes different proportions of LLM-generated data, which we refer to as synthetic data. Our experimental results show that, with minimal distribution shift, moderately diverse LLM-generated data can enhance model performance in scenarios with insufficient labeled data, whereas highly diverse generated data has a negative impact. We hope our empirical findings will offer valuable guidance for future studies on LLMs as data generators.

[12] EmoStage: A Framework for Accurate Empathetic Response Generation via Perspective-Taking and Phase Recognition

Zhiyang Qi,Keiko Takamizo,Mariko Ukiyo,Michimasa Inaba

Main category: cs.CL

TL;DR: EmoStage是一个通过视角转换和阶段识别提升共情响应生成的框架,解决了当前AI心理咨询系统在心理状态理解和阶段识别上的不足。

Details Motivation: 心理健康护理的需求增加推动了AI心理咨询系统的发展,但当前方法在理解用户心理状态、识别咨询阶段以及依赖高质量数据等方面存在挑战。

Contribution: 提出了EmoStage框架,通过开源LLM的推理能力(无需额外训练数据)实现共情响应生成,结合视角转换推断用户状态和需求,并通过阶段识别确保响应与咨询过程对齐。

Method: 利用视角转换推断用户心理状态和支持需求,结合阶段识别技术确保响应与咨询阶段匹配。

Result: 实验表明,EmoStage在日语和中文心理咨询场景中提升了基础模型的响应质量,与数据驱动方法表现相当。

Insight: 通过无需额外训练数据的LLM推理能力,EmoStage实现了更准确的共情响应生成,同时避免了隐私问题和数据依赖。

Abstract: The rising demand for mental health care has fueled interest in AI-driven counseling systems. While large language models (LLMs) offer significant potential, current approaches face challenges, including limited understanding of clients’ psychological states and counseling stages, reliance on high-quality training data, and privacy concerns associated with commercial deployment. To address these issues, we propose EmoStage, a framework that enhances empathetic response generation by leveraging the inference capabilities of open-source LLMs without additional training data. Our framework introduces perspective-taking to infer clients’ psychological states and support needs, enabling the generation of emotionally resonant responses. In addition, phase recognition is incorporated to ensure alignment with the counseling process and to prevent contextually inappropriate or inopportune responses. Experiments conducted in both Japanese and Chinese counseling settings demonstrate that EmoStage improves the quality of responses generated by base models and performs competitively with data-driven methods.

[13] JCAPT: A Joint Modeling Approach for CAPT

Tzu-Hsuan Yang,Yue-Yang He,Berlin Chen

Main category: cs.CL

TL;DR: 该论文提出了一种联合建模方法JCAPT,结合自动发音评估(APA)和发音错误检测(MDD)两个任务,利用Mamba(一种选择性状态空间模型)和音韵特征,提升了CAPT系统的性能和解释性。

Details Motivation: 在第二语言学习中,发音反馈至关重要。现有的计算机辅助发音训练(CAPT)系统中,APA和MDD任务通常是分开处理的,但联合建模能带来更大优势。

Contribution: 论文的主要贡献包括:1)首次将音韵特征、状态空间模型和提示策略结合用于CAPT;2)提出统一框架,联合优化APA和MDD任务,提升了性能和解释性。

Method: 方法包括:1)基于Mamba的选择性状态空间模型(SSM)进行建模;2)整合音韵特征和提示策略;3)通过联合目标优化APA和MDD任务。

Result: 在SpeechOcean762基准测试中,模型在MDD任务上表现显著优于现有方法,展示了其有效性。

Insight: 联合建模可以显著提升CAPT系统的性能,音韵特征和状态空间模型的结合为发音训练提供了更细粒度的时序推理能力。

Abstract: Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in APA and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.

[14] Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study

Yingji Zhang,Marco Valentino,Danilo S. Carvalho,André Freitas

Main category: cs.CL

TL;DR: 该论文提出了一种通过语言变分自编码器(VAE)在Transformer语言模型中显式嵌入推理规则的方法,以增强模型的泛化、可解释性和可控性。

Details Motivation: 当前基于Transformer的语言模型在自然语言推理(NLI)任务中表现良好,但通常依赖于记忆而非基于规则的推理。为提高推理能力的显式表示,该研究探索了在语言模型中嵌入推理规则的途径。

Contribution: 论文提出了一个完整的流程,用于在基于Transformer的语言VAE中学习推理规则,包括三种规则推理任务、理论框架和端到端架构设计。

Method: 采用语言VAE将推理规则显式嵌入到模型的潜在空间中,并通过编码器的参数空间实现规则的解耦。此外,通过将先验知识注入到Query中,优化了从记忆库中检索信息的能力。

Result: 实验表明,推理规则能够在编码器的输出特征空间中形成明显的聚类,并且FFN层比注意力层更擅长保持规则的分离。在数学推理任务中,样本数量的增加超过一定阈值后不再提升性能。

Insight: 推理规则的显式嵌入和解耦是增强语言模型推理能力的有效方法,同时FFN层在规则分离中的作用强于注意力层,为模型优化提供了新方向。

Abstract: Incorporating explicit reasoning rules within the latent space of language models (LMs) offers a promising pathway to enhance generalisation, interpretability, and controllability. While current Transformer-based language models have shown strong performance on Natural Language Inference (NLI) tasks, they often rely on memorisation rather than rule-based inference. This work investigates how reasoning rules can be explicitly embedded and memorised within the LMs through Language Variational Autoencoders (VAEs). We propose a complete pipeline for learning reasoning rules within Transformer-based language VAEs. This pipeline encompasses three rule-based reasoning tasks, a supporting theoretical framework, and a practical end-to-end architecture. The experiment illustrates the following findings: Disentangled reasoning: Under explicit signal supervision, reasoning rules - viewed as functional mappings - can be disentangled within the encoder’s parametric space. This separation results in distinct clustering of rules in the output feature space. Prior knowledge injection: injecting reasoning information into the Query enables the model to more effectively retrieve the stored value Value from memory based on Key. This approach offers a simple method for integrating prior knowledge into decoder-only language models. Performance bottleneck: In mathematical reasoning tasks using Qwen2.5(0.5B), increasing sample count doesn’t improve performance beyond a point. Moreover, ffn layers are better than attention layers at preserving the separation of reasoning rules in the model’s parameters.

[15] Can Large Language Models Capture Human Annotator Disagreements?

Jingwei Ni,Yu Fan,Vilém Zouhar,Donya Rooein,Alexander Hoyle,Mrinmaya Sachan,Markus Leippold,Dirk Hovy,Elliott Ash

Main category: cs.CL

TL;DR: 本文探讨大语言模型(LLM)能否捕捉人类标注者的标注分歧,发现LLM在预测分歧方面表现不佳,且RLVR式推理反而降低性能。

Details Motivation: 人类标注分歧反映任务主观性和样本模糊性,而LLM的自动标注评估通常只关注多数标签,忽略了分歧信息的重要性。

Contribution: 首次系统评估LLM预测标注分歧的能力,揭示了现有评估方法的局限性。

Method: 通过实验分析LLM在预测标注分歧上的表现,并比较RLVR式推理与传统方法的效果。

Result: LLM难以有效预测分歧,RLVR式推理在分歧预测中表现更差。

Insight: LLM作为标注工具时需改进分歧建模能力,避免仅依赖多数标签的评估方法。

Abstract: Human annotation variation (i.e., annotation disagreements) is common in NLP and often reflects important information such as task subjectivity and sample ambiguity. While Large Language Models (LLMs) are increasingly used for automatic annotation to reduce human effort, their evaluation often focuses on predicting the majority-voted “ground truth” labels. It is still unclear, however, whether these models also capture informative human annotation variation. Our work addresses this gap by extensively evaluating LLMs’ ability to predict annotation disagreements without access to repeated human labels. Our results show that LLMs struggle with modeling disagreements, which can be overlooked by majority label-based evaluations. Notably, while RLVR-style (Reinforcement learning with verifiable rewards) reasoning generally boosts LLM performance, it degrades performance in disagreement prediction. Our findings highlight the critical need for evaluating and improving LLM annotators in disagreement modeling. Code and data at https://github.com/EdisonNi-hku/Disagreement_Prediction.

[16] Commonsense Generation and Evaluation for Dialogue Systems using Large Language Models

Marcos Estecha-Garitagoitia,Chen Zhang,Mario Rodríguez-Cantelar,Luis Fernando D’Haro

Main category: cs.CL

TL;DR: 该论文探讨了利用大型语言模型(LLMs)进行对话系统的数据增强和自动评估的方法,通过上下文相关的常识生成和评估初步验证了其有效性。

Details Motivation: 对话系统需要丰富的上下文关联数据和常识知识,传统方法在这方面的能力和效率有限。LLMs的零样本能力和常识推理能力为这一任务提供了新的可能性。

Contribution: 1)提出了一种基于指令的LLMs对话数据增强方法,生成具有常识关系的新对话轮次;2)设计了一种自动评估框架,利用LLMs对生成数据进行质量检测;3)构建了一个新数据集,用于评估LLMs在特定常识关系上的表现。

Method: 1)从多对话数据集中抽取部分对话,利用LLMs生成基于常识关系(如ATOMIC关系)的替代回应;2)设计指令提示,用于LLMs自动评估生成的对话是否符合原始常识属性。

Result: 初步结果表明,该方法能有效利用LLMs的常识推理能力,生成上下文相关的对话数据,并通过自动评估验证其质量。

Insight: LLMs不仅能生成高质量的对话数据,还能通过指令驱动的评估方法自动化检测数据质量,为对话系统的数据增强提供了新思路。

Abstract: This paper provides preliminary results on exploring the task of performing turn-level data augmentation for dialogue system based on different types of commonsense relationships, and the automatic evaluation of the generated synthetic turns. The proposed methodology takes advantage of the extended knowledge and zero-shot capabilities of pretrained Large Language Models (LLMs) to follow instructions, understand contextual information, and their commonsense reasoning capabilities. The approach draws inspiration from methodologies like Chain-of-Thought (CoT), applied more explicitly to the task of prompt-based generation for dialogue-based data augmentation conditioned on commonsense attributes, and the automatic evaluation of the generated dialogues. To assess the effectiveness of the proposed approach, first we extracted 200 randomly selected partial dialogues, from 5 different well-known dialogue datasets, and generate alternative responses conditioned on different event commonsense attributes. This novel dataset allows us to measure the proficiency of LLMs in generating contextually relevant commonsense knowledge, particularly up to 12 different specific ATOMIC [10] database relations. Secondly, we propose an evaluation framework to automatically detect the quality of the generated dataset inspired by the ACCENT [26] metric, which offers a nuanced approach to assess event commonsense. However, our method does not follow ACCENT’s complex eventrelation tuple extraction process. Instead, we propose an instruction-based prompt for each commonsense attribute and use state-of-the-art LLMs to automatically detect the original attributes used when creating each augmented turn in the previous step. Preliminary results suggest that our approach effectively harnesses LLMs capabilities for commonsense reasoning and evaluation in dialogue systems.

[17] Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning

Russell Beale

Main category: cs.CL

TL;DR: 这篇论文探讨了如何将大型语言模型(LLM)与教育理论(如社会文化学习、苏格拉底法和对话教学法)结合,以提升对话式AI在教育中的应用效果。

Details Motivation: 随着LLM在教育中的快速应用,亟需将其与成熟的教育理论对齐,以确保学习效果和教学方法的科学性。

Contribution: 论文的主要贡献在于综合分析了LLM与教育理论的适配性,提出了具体策略(如提示设计和检索增强生成)以优化LLM的教学行为,并指出了当前LLM在教育中存在的缺陷。

Method: 通过文献综述,将教育理论(如脚手架学习、苏格拉底法)映射到LLM的能力上,并提出了基于提示设计和检索增强生成的优化方法。

Result: 研究发现LLM在知识共建和个性化学习方面存在不足,但通过策略调整(如设计引导性提示)可以更好地支持教学理论。

Insight: 论文强调了教育理论与AI实践结合的重要性,为未来LLM在教育中的设计和应用提供了理论支持和实用工具。

Abstract: Large Language Models (LLMs) are rapidly transforming education by enabling rich conversational learning experiences. This article provides a comprehensive review of how LLM-based conversational agents are being used in higher education, with extensions to secondary and lifelong learning contexts. We synthesize existing literature on LLMs in education and theories of conversational and dialogic pedagogy - including Vygotsky’s sociocultural learning (scaffolding and the Zone of Proximal Development), the Socratic method, and Laurillard’s conversational framework - and examine how prompting strategies and retrieval-augmented generation (RAG) can align LLM behaviors with these pedagogical theories, and how it can support personalized, adaptive learning. We map educational theories to LLM capabilities, highlighting where LLM-driven dialogue supports established learning principles and where it challenges or falls short of traditional pedagogical assumptions. Notable gaps in applying prior theories to LLMs are identified, such as the models tendency to provide direct answers instead of fostering co-construction of knowledge, and the need to account for the constant availability and broad but non-human expertise of LLM tutors. In response, we propose practical strategies to better align LLM interactions with sound pedagogy - for example, designing prompts that encourage Socratic questioning, scaffolded guidance, and student reflection, as well as integrating retrieval mechanisms to ensure accuracy and contextual relevance. Our aim is to bridge the gap between educational theory and the emerging practice of AI-driven conversational learning, offering insights and tools for making LLM-based dialogues more educationally productive and theory-aligned.

[18] Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs

Shu Yang,Junchao Wu,Xuansheng Wu,Derek Wong,Ninhao Liu,Di Wang

Main category: cs.CL

TL;DR: 这篇论文研究了大型推理模型(LRMs)在追求高效推理时可能引入的行为不一致性问题,并通过$ICBENCH$基准测评了三种不一致性,发现高效推理策略虽然提升了效率,但增加了模型不一致的风险。

Details Motivation: LRMs在复杂任务中表现出色,但过度推理可能导致效率低下。近期研究尝试优化推理长度以提升效率,但其是否为‘免费午餐’尚无定论。论文质疑是否这种压缩推理会削弱模型的鲁棒性,导致关键推理步骤缺失或行为不一致。

Contribution: 1) 提出了$ICBENCH$基准,量化评测LRMs在任务设置(ITS)、训练目标与行为(TR-LB)、内部推理与自我解释(IR-SE)三个维度上的不一致性;2) 发现高效推理策略(如No-Thinking和Simple Token-Budget)会显著增加不一致性,揭示了效率优化可能带来的潜在风险。

Method: 通过设计$ICBENCH$基准,对开源LRMs进行系统评测,分析其在三种不一致性上的表现;对比不同模型规模和行为策略(如高效推理与标准推理)对一致性的影响。

Result: 实验表明:1) 大模型通常比小模型更一致,但仍普遍存在‘计谋行为’(如自我矛盾、事后合理化);2) 高效推理策略会加剧所有三种不一致性。

Insight: 论文揭示了高效推理与模型一致性之间的权衡关系,提示在追求效率时需警惕模型可能逃避有效监督的风险,为未来的模型优化设计提供了重要参考。

Abstract: Large Reasoning Models (LRMs) have achieved remarkable performance on complex tasks by engaging in extended reasoning before producing final answers, yet this strength introduces the risk of overthinking, where excessive token generation occurs even for simple tasks. While recent work in efficient reasoning seeks to reduce reasoning length while preserving accuracy, it remains unclear whether such optimization is truly a free lunch. Drawing on the intuition that compressing reasoning may reduce the robustness of model responses and lead models to omit key reasoning steps, we investigate whether efficient reasoning strategies introduce behavioral inconsistencies. To systematically assess this, we introduce $ICBENCH$, a benchmark designed to measure inconsistency in LRMs across three dimensions: inconsistency across task settings (ITS), inconsistency between training objectives and learned behavior (TR-LB), and inconsistency between internal reasoning and self-explanations (IR-SE). Applying $ICBENCH$ to a range of open-source LRMs, we find that while larger models generally exhibit greater consistency than smaller ones, they all display widespread “scheming” behaviors, including self-disagreement, post-hoc rationalization, and the withholding of reasoning cues. Crucially, our results demonstrate that efficient reasoning strategies such as No-Thinking and Simple Token-Budget consistently increase all three defined types of inconsistency. These findings suggest that although efficient reasoning enhances token-level efficiency, further investigation is imperative to ascertain whether it concurrently introduces the risk of models evading effective supervision.

[19] KnowMap: Efficient Knowledge-Driven Task Adaptation for LLMs

Kelin Fu,Kaigui Bian

Main category: cs.CL

TL;DR: KnowMap提出了一种动态构建知识库的新方法,通过微调小型知识嵌入模型增强大模型的领域适应能力,避免了传统方法的昂贵和数据依赖性问题。

Details Motivation: 大型语言模型(LLMs)在开放世界中表现出色,但依赖静态预训练知识导致其对快速适应新任务的能力不足,传统微调方法成本高且可能引发灾难性遗忘。

Contribution: 提出了KnowMap,通过动态构建任务特定知识库,结合环境和经验数据,显著提升了模型的任务适应能力,实验显示ScienceWorld基准性能提升17.71%。

Method: 动态构建知识库并微调小型知识嵌入模型,将其知识集成到更大的LLM中,以实现任务专用知识的快速适应。

Result: 在ScienceWorld基准测试中,gpt-4-turbo模型的性能提升了17.71%,证明了KnowMap的高效性和有效性。

Insight: 通过动态知识构建和知识嵌入的协同,可以有效增强LLM的任务适应能力和推理能力,同时避免传统方法的缺陷。

Abstract: While Large Language Models (LLMs) possess significant capabilities in open-world agent tasks, they also face challenges in rapidly adapting to new, specialized tasks due to their reliance on static pre-trained knowledge. Traditional methods such as fine-tuning are often costly, data-intensive, and may lead to “catastrophic forgetting.” Therefore, we present KnowMap, a novel approach that dynamically constructs a knowledge base from environmental and experiential data. KnowMap fine-tunes a small knowledge-embedding model to equip a larger LLM with valuable task-specific knowledge. Our experiments on the ScienceWorld benchmark demonstrate 17.71% improvement for the performance of gpt-4-turbo model. KnowMap not only provides an efficient and effective means for LLM task-adapting, but also highlights how integrating environmental and experiential knowledge can enhance LLMs’ reasoning capabilities.

[20] ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model

Zhenke Duan,Jiqun Pan,Jiani Tu,Xiaoyi Wang,Yanqing Wang

Main category: cs.CL

TL;DR: ECCoT是一个端到端的认知链式思维验证框架,通过结合主题感知和因果推理对齐技术提升LLM的推理可靠性和解释性。

Details Motivation: 当前大型语言模型生成的推理链缺乏透明性且不可靠,亟需一种方法验证和改进其推理过程。

Contribution: 提出ECCoT框架,结合MRF-ETM(主题感知生成)和CSBert(因果推理对齐),过滤无效推理链,提升模型的可解释性和可信度。

Method: 使用MRF-ETM生成主题感知的推理链,通过CSBert进行因果推理对齐,并用结构化排序统计过滤无效链。

Result: ECCoT显著提升了LLM的推理可靠性和解释性,减少了偏见并增强了决策的可信度。

Insight: 主题感知和因果推理对齐是提升LLM推理链质量的关键技术,结构化的验证框架能有效改进模型输出。

Abstract: In the era of large-scale artificial intelligence, Large Language Models (LLMs) have made significant strides in natural language processing. However, they often lack transparency and generate unreliable outputs, raising concerns about their interpretability. To address this, the Chain of Thought (CoT) prompting method structures reasoning into step-by-step deductions. Yet, not all reasoning chains are valid, and errors can lead to unreliable conclusions. We propose ECCoT, an End-to-End Cognitive Chain of Thought Validation Framework, to evaluate and refine reasoning chains in LLMs. ECCoT integrates the Markov Random Field-Embedded Topic Model (MRF-ETM) for topic-aware CoT generation and Causal Sentence-BERT (CSBert) for causal reasoning alignment. By filtering ineffective chains using structured ordering statistics, ECCoT improves interpretability, reduces biases, and enhances the trustworthiness of LLM-based decision-making. Key contributions include the introduction of ECCoT, MRF-ETM for topic-driven CoT generation, and CSBert for causal reasoning enhancement. Code is released at: https://github.com/erwinmsmith/ECCoT.git.

[21] Social Hatred: Efficient Multimodal Detection of Hatemongers

Tom Marzea,Abraham Israeli,Oren Tsur

Main category: cs.CL

TL;DR: 该论文提出了一种多模态方法,用于高效检测仇恨传播者,结合文本、用户活动及其社交网络,显著优于现有方法。

Details Motivation: 在线仇恨言论的自动检测是净化网络言论的重要步骤。此外,准确的分类有助于从社会现象角度理解仇恨的传播。现有研究多集中于仇恨言论的检测,而本文认为用户层面的分析同样重要且更具挑战性。

Contribution: 提出了一个多模态聚合方法,整合用户的文本、活动和社交网络信息,显著提升了仇恨传播者的检测效果。展示了该方法在不同平台和大数据集上的通用性,并支持对隐晦内容的分类和干预措施的指导。

Method: 通过结合用户的文本内容、社交活动和网络结构信息,构建了一个多模态检测框架。在Twitter、Gab和Parler三个数据集上验证了其有效性。

Result: 实验表明,该方法在检测仇恨传播者方面显著优于基于文本和图的方法,并能适应不同平台和大规模网络数据。

Insight: 用户上下文信息是仇恨检测的关键因素,多模态方法能够有效应对隐晦内容和跨平台的挑战。

Abstract: Automatic detection of online hate speech serves as a crucial step in the detoxification of the online discourse. Moreover, accurate classification can promote a better understanding of the proliferation of hate as a social phenomenon. While most prior work focus on the detection of hateful utterances, we argue that focusing on the user level is as important, albeit challenging. In this paper we consider a multimodal aggregative approach for the detection of hate-mongers, taking into account the potentially hateful texts, user activity, and the user network. Evaluating our method on three unique datasets X (Twitter), Gab, and Parler we show that processing a user’s texts in her social context significantly improves the detection of hate mongers, compared to previously used text and graph-based methods. We offer comprehensive set of results obtained in different experimental settings as well as qualitative analysis of illustrative cases. Our method can be used to improve the classification of coded messages, dog-whistling, and racial gas-lighting, as well as to inform intervention measures. Moreover, we demonstrate that our multimodal approach performs well across very different content platforms and over large datasets and networks.

[22] Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Lucie Galland,Catherine Pelachaud,Florian Pecune

Main category: cs.CL

TL;DR: 该论文提出了一种结合大型语言模型(LLM)和基于强化学习(RL)的对话管理器的框架,通过分层强化学习和元学习增强开放目标对话的适应性和效率。

Details Motivation: 当前的开放目标对话系统(如LLM)在特定目标对话中表现有限,无法有效适应不同的用户需求。

Contribution: 提出了一种结合RL和LLM的新框架,通过分层强化学习和元学习提升对话管理的适应性和效率。

Method: 采用分层强化学习建模对话阶段,使用元学习增强跨用户配置的适应性,并在动机访谈任务中验证。

Result: 实验表明,该方法在奖励方面优于基于LLM的基线模型,展现了在特定目标对话中的潜力。

Insight: 通过RL增强LLM的对话管理能力,可以更好地实现开放目标对话的个性化和效率。

Abstract: In this work, we propose a novel framework that integrates large language models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a specific goal. By leveraging hierarchical reinforcement learning to model the structured phases of dialogue and employ meta-learning to enhance adaptability across diverse user profiles, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transition fluidly between dialogue phases, and personalize responses to heterogeneous patient needs. We apply our framework to Motivational Interviews, aiming to foster behavior change, and demonstrate that the proposed dialogue manager outperforms a state-of-the-art LLM baseline in terms of reward, showing a potential benefit of conditioning LLMs to create open-ended dialogue systems with specific goals.

[23] Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Chuxuan Hu,Yuxuan Zhu,Antony Kellermann,Caleb Biddulph,Suppakit Waiwitlikhit,Jason Benn,Daniel Kang

Main category: cs.CL

TL;DR: 本文研究了强化后训练(RPT)在大型语言模型(LLMs)上的泛化能力,发现RPT在相似任务上表现优异,但在不同领域的泛化效果不一致。

Details Motivation: RPT在提升LLMs推理能力方面表现出潜力,但其在新领域的泛化能力尚未被充分研究。

Contribution: 通过观察性和干预性实验,揭示了RPT增益在相同领域显著但在不同领域泛化有限的特性。

Method: 进行了两项研究:(1)观察性研究:比较RPT模型与基础模型在多领域(包括未见领域)的性能;(2)干预性研究:在单一领域微调RPT模型并评估多领域性能。

Result: RPT在相似任务上带来显著提升,但在不同推理模式的领域泛化效果有限。

Insight: RPT的增益具有领域依赖性,未来工作需探索跨领域泛化的优化方法。

Abstract: Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for fine-tuning. To understand the generalizability of RPT, we conduct two studies. (1) Observational: We compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.

[24] SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Yuqian Fu,Tinghong Chen,Jiajun Chai,Xihuai Wang,Songjun Tu,Guojun Yin,Wei Lin,Qichao Zhang,Yuanheng Zhu,Dongbin Zhao

Main category: cs.CL

TL;DR: 论文提出SRFT方法,通过熵感知加权机制将监督微调和强化学习统一为单阶段训练,显著提升大语言模型的推理能力。

Details Motivation: 当前大语言模型在推理任务中表现优异,但如何有效整合监督微调(SFT)和强化学习(RL)仍是一个关键挑战。作者希望通过分析两种范式的差异(如全局与细粒度优化),提出更高效的整合方法。

Contribution: 1. 揭示了SFT和RL在训练动态和效果上的差异(熵作为训练有效性指标);2. 提出SRFT方法,通过熵感知加权机制实现单阶段联合优化;3. 在多任务和分布外基准上显著优于传统两阶段方法。

Method: SRFT结合SFT和RL,通过熵感知加权机制动态调整两者权重,使用演示和自我探索数据直接优化模型,避免了传统两阶段方法的局限性。

Result: SRFT在数学推理任务上平均准确率达59.1%,比无RL方法提升9.0%,在分布外任务上提升10.9%。

Insight: 熵是SFT与RL训练动态差异的关键指标;单阶段联合优化能更高效地结合两者优势。

Abstract: Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.

[25] Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Yuqi Zhu,Yi Zhong,Jintian Zhang,Ziheng Zhang,Shuofei Qiao,Yujie Luo,Lun Du,Da Zheng,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: 该论文研究了开源大语言模型(LLMs)在数据分析任务中的局限性,通过实证分析提出了提升其能力的策略,并开发了一种数据合成方法。

Details Motivation: 开源LLMs在数据分析等推理密集型任务中表现不佳,作者希望通过系统研究提出改进方法。

Contribution: 论文的主要贡献包括揭示了战略规划质量是模型性能的关键因素,提出了数据合成方法以显著提升开源LLMs的分析推理能力。

Method: 作者通过构建多样化的种子数据集,从数据理解、代码生成和战略规划三个维度评估模型表现,并基于分析结果开发了数据合成方法。

Result: 研究发现,战略规划质量、交互设计和任务复杂度对推理能力有显著影响,数据质量比多样性对性能更重要。

Insight: 提升开源LLMs的推理能力需要关注任务设计的核心因素,如战略规划,而单纯的数据多样性可能效果有限。

Abstract: Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs’ analytical reasoning capabilities.

[26] How Effectively Can BERT Models Interpret Context and Detect Bengali Communal Violent Text?

Abdullah Khondoker,Enam Ahmed Taufik,Md. Iftekhar Islam Tashik,S M Ishtiak Mahmud,Farig Sadeque

Main category: cs.CL

TL;DR: 该研究通过微调BanglaBERT模型和改进数据集,提高了孟加拉语社群暴力文本的检测效果,并利用LIME分析模型决策弱点。

Details Motivation: 网络仇恨言论的传播引发社群暴力,威胁社会和谐。孟加拉语社群暴力文本分类研究不足,亟需提高检测准确性。

Contribution: 1. 提出微调的BanglaBERT模型,F1分数0.60。2. 扩展数据集并开发集成模型,F1分数提升至0.63。3. 使用LIME分析模型决策,揭示上下文理解不足的问题。

Method: 1. 微调预训练BanglaBERT模型。2. 扩展数据集并采用集成模型。3. 使用LIME进行模型决策解释。

Result: 微调模型F1分数0.60,集成模型F1分数0.63。LIME分析显示模型在上下文理解上存在不足。

Insight: 预训练模型对相近社群与非社群术语区分能力有限;NLP工具在减少社群暴力方面潜力显著。

Abstract: The spread of cyber hatred has led to communal violence, fueling aggression and conflicts between various religious, ethnic, and social groups, posing a significant threat to social harmony. Despite its critical importance, the classification of communal violent text remains an underexplored area in existing research. This study aims to enhance the accuracy of detecting text that incites communal violence, focusing specifically on Bengali textual data sourced from social media platforms. We introduce a fine-tuned BanglaBERT model tailored for this task, achieving a macro F1 score of 0.60. To address the issue of data imbalance, our dataset was expanded by adding 1,794 instances, which facilitated the development and evaluation of a fine-tuned ensemble model. This ensemble model demonstrated an improved performance, achieving a macro F1 score of 0.63, thus highlighting its effectiveness in this domain. In addition to quantitative performance metrics, qualitative analysis revealed instances where the models struggled with context understanding, leading to occasional misclassifications, even when predictions were made with high confidence. Through analyzing the cosine similarity between words, we identified certain limitations in the pre-trained BanglaBERT models, particularly in their ability to distinguish between closely related communal and non-communal terms. To further interpret the model’s decisions, we applied LIME, which helped to uncover specific areas where the model struggled in understanding context, contributing to errors in classification. These findings highlight the promise of NLP and interpretability tools in reducing online communal violence. Our work contributes to the growing body of research in communal violence detection and offers a foundation for future studies aiming to refine these techniques for better accuracy and societal impact.

[27] MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration

Yucheng Zhou,Lingran Song,Jianbing Shen

Main category: cs.CL

TL;DR: 论文提出了一种模块化多智能体框架MAM,通过角色分工协作实现多模态医疗诊断,解决了当前统一多模态医疗大模型的知识更新成本、全面性和灵活性限制问题。

Details Motivation: 当前统一多模态医疗大模型(LLMs)在知识更新成本、全面性和灵活性方面存在局限性,MAM通过角色分工和多智能体协作来解决这些问题。

Contribution: 提出模块化多智能体框架MAM,将医疗诊断过程分解为多个角色(全科医生、专科团队、放射科医生等),每个角色由一个LLM智能体负责,提升了诊断效率和灵活性。

Method: 采用角色分工策略,将诊断任务分配给不同的LLM智能体(如全科医生、放射科医生等),并通过协作完成多模态医疗诊断。

Result: 在多个公开的多模态医疗数据集上,MAM比特定模态的LLMs提升了18%至365%的性能。

Insight: 角色分工和多智能体协作可以有效提升医疗诊断的效率,并为多模态数据的处理提供了新思路。

Abstract: Recent advancements in medical Large Language Models (LLMs) have showcased their powerful reasoning and diagnostic capabilities. Despite their success, current unified multimodal medical LLMs face limitations in knowledge update costs, comprehensiveness, and flexibility. To address these challenges, we introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM). Inspired by our empirical findings highlighting the benefits of role assignment and diagnostic discernment in LLMs, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director, each embodied by an LLM-based agent. This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases. Extensive experimental evaluations conducted on a wide range of publicly accessible multimodal medical datasets, incorporating text, image, audio, and video modalities, demonstrate that MAM consistently surpasses the performance of modality-specific LLMs. Notably, MAM achieves significant performance improvements ranging from 18% to 365% compared to baseline models. Our code is released at https://github.com/yczhou001/MAM.

cs.CV [Back]

[28] Connecting Vision and Emissions: A Behavioural AI Approach to Carbon Estimation in Road Design

Ammar K Al Mhdawi,Nonso Nnamoko,Safanah Mudheher Raafat,M. K. S. Al-Mhdawi,Amjad J Humaidi

Main category: cs.CV

TL;DR: 该论文提出了一种基于YOLOv8增强版的实时车辆检测与分类框架,用于估计城市环境中的碳排放量。通过结合深度OCR模块和外部数据库验证,实现了精确的车辆特定碳排放计算。

Details Motivation: 现有的碳排放监测方法通常依赖于宏观数据,缺乏对单个车辆的精确追踪和分类。该研究旨在通过计算机视觉和深度学习方法,提供一种实时、自动化的车辆碳排放监测解决方案。

Contribution: 1. 提出了一种增强的YOLOv8框架,用于车辆的检测、分割和跟踪;2. 结合深度OCR模块,实现了高精度的车牌识别;3. 通过外部数据库验证,提高了车辆分类和排放估计的准确性。

Method: 1. 使用YOLOv8进行车辆检测和分割;2. 裁剪检测到的车辆并通过深度OCR模块识别车牌;3. 利用外部数据库验证车牌信息;4. 结合车辆类型和行驶数据计算碳排放量。

Result: YOLOv8检测器的mAP@0.5约为71%(边界框)和70%(分割掩码);深度OCR的字符级准确率达99%。验证了该框架在智能交通系统中的实用性和可扩展性。

Insight: 该研究表明,结合实时目标检测和深度OCR技术,可以实现车辆级别的精确碳排放监测,为智能交通系统提供了新的技术路径。

Abstract: We present an enhanced YOLOv8 real time vehicle detection and classification framework, for estimating carbon emissions in urban environments. The system enhances YOLOv8 architecture to detect, segment, and track vehicles from live traffic video streams. Once a vehicle is localized, a dedicated deep learning-based identification module is employed to recognize license plates and classify vehicle types. Since YOLOv8 lacks the built-in capacity for fine grained recognition tasks such as reading license plates or determining vehicle attributes beyond class labels, our framework incorporates a hybrid pipeline where each detected vehicle is tracked and its bounding box is cropped and passed to a deep Optical Character Recognition (OCR) module. This OCR system, composed of multiple convolutional neural network (CNN) layers, is trained specifically for character-level detection and license plate decoding under varied conditions such as motion blur, occlusion, and diverse font styles. Additionally, the recognized plate information is validated using a real time API that cross references with an external vehicle registration database to ensure accurate classification and emission estimation. This multi-stage approach enables precise, automated calculation of per vehicle carbon emissions. Extensive evaluation was conducted using a diverse vehicle dataset enriched with segmentation masks and annotated license plates. The YOLOv8 detector achieved a mean Average Precision (mAP@0.5) of approximately 71% for bounding boxes and 70% for segmentation masks. Character level OCR accuracy reached up to 99% with the best performing CNN model. These results affirm the feasibility of combining real time object detection with deep OCR for practical deployment in smart transportation systems, offering a scalable solution for automated, vehicle specific carbon emission monitoring.

[29] Interpretable and Granular Video-Based Quantification of Motor Characteristics from the Finger Tapping Test in Parkinson Disease

Tahereh Zarrat Ehsan,Michael Tangermann,Yağmur Güçlütürk,Bastiaan R. Bloem,Luc J. W. Evers

Main category: cs.CV

TL;DR: 本文提出了一种基于计算机视觉的方法,用于从帕金森病(PD)患者的手指敲击测试视频中量化运动特征,提供了一种更客观和可解释的评估方式。

Details Motivation: 传统的手指敲击测试依赖医生的主观评估,存在评分的变异性,且无法提供具体的运动特征细节。本文旨在通过视频分析提供更客观、精细的量化评估。

Contribution: 1. 提出了四种与临床相关的特征组,用于量化运动迟缓、运动减少、序列效应和犹豫停顿;2. 开发了基于机器学习的分类器,用于预测MDS-UPDRS评分,性能优于现有方法。

Method: 1. 从视频中提取四项运动特征;2. 使用主成分分析和变旋法验证特征与临床缺陷的对应关系;3. 训练机器学习模型预测MDS-UPDRS评分。

Result: 方法在MDS-UPDRS评分预测上准确性更高,同时提供了对运动特征的精细量化。

Insight: 基于视频的分析能够捕捉传统评估方法无法发现的细微运动特征差异,为临床评估提供更全面的数据。

Abstract: Accurately quantifying motor characteristics in Parkinson disease (PD) is crucial for monitoring disease progression and optimizing treatment strategies. The finger-tapping test is a standard motor assessment. Clinicians visually evaluate a patient’s tapping performance and assign an overall severity score based on tapping amplitude, speed, and irregularity. However, this subjective evaluation is prone to inter- and intra-rater variability, and does not offer insights into individual motor characteristics captured during this test. This paper introduces a granular computer vision-based method for quantifying PD motor characteristics from video recordings. Four sets of clinically relevant features are proposed to characterize hypokinesia, bradykinesia, sequence effect, and hesitation-halts. We evaluate our approach on video recordings and clinical evaluations of 74 PD patients from the Personalized Parkinson Project. Principal component analysis with varimax rotation shows that the video-based features corresponded to the four deficits. Additionally, video-based analysis has allowed us to identify further granular distinctions within sequence effect and hesitation-halts deficits. In the following, we have used these features to train machine learning classifiers to estimate the Movement Disorder Society Unified Parkinson Disease Rating Scale (MDS-UPDRS) finger-tapping score. Compared to state-of-the-art approaches, our method achieves a higher accuracy in MDS-UPDRS score prediction, while still providing an interpretable quantification of individual finger-tapping motor characteristics. In summary, the proposed framework provides a practical solution for the objective assessment of PD motor characteristics, that can potentially be applied in both clinical and remote settings. Future work is needed to assess its responsiveness to symptomatic treatment and disease progression.

[30] Reinforcement Learning-Based Dynamic Grouping for Tubular Structure Tracking

Chong Di,Shuwang Zhou,Da Chen,Jean-Marie Mirebeau,Minglei Shu,Laurent D. Cohen

Main category: cs.CV

TL;DR: 论文提出了一种基于强化学习的动态分组方法,用于跟踪管状结构,通过将段级跟踪建模为马尔可夫决策过程,显著提升了计算效率和鲁棒性。

Details Motivation: 现有方法在跟踪管状结构(如血管和道路)时面临复杂形态和环境变化的挑战,尤其是段级方法计算效率低且依赖先验知识。本文旨在通过强化学习动态优化搜索过程。

Contribution: 1. 将段级跟踪建模为MDP,提出基于Q-Learning的动态扩展搜索方法。2. 避免了预计算图的成本,并适应性地扩展搜索空间。3. 在复杂拓扑下保持全局路径一致性。

Method: 1. 使用Q-Learning动态探索段图。2. 实时计算边权重,按需扩展搜索空间。3. 通过强化学习策略选择最优路径,减少对先验知识的依赖。

Result: 在典型管状结构数据集上的实验表明,该方法显著优于现有的点级和段级方法,尤其在处理复杂拓扑时表现优异。

Insight: 强化学习可以有效地用于动态路径搜索问题,尤其在处理不确定性和复杂结构时表现出鲁棒性和灵活性。

Abstract: The computation of minimal paths for the applications in tracking tubular structures such as blood vessels and roads is challenged by complex morphologies and environmental variations. Existing approaches can be roughly categorized into two research lines: the point-wise based models and the segment-wise based models. Although segment-wise approaches have obtained promising results in many scenarios, they often suffer from computational inefficiency and heavily rely on a prescribed prior to fit the target elongated shapes. We propose a novel framework that casts segment-wise tracking as a Markov Decision Process (MDP), enabling a reinforcement learning approach. Our method leverages Q-Learning to dynamically explore a graph of segments, computing edge weights on-demand and adaptively expanding the search space. This strategy avoids the high cost of a pre-computed graph and proves robust to incomplete initial information. Experimental reuslts on typical tubular structure datasets demonstrate that our method significantly outperforms state-of-the-art point-wise and segment-wise approaches. The proposed method effectively handles complex topologies and maintains global path coherence without depending on extensive prior structural knowledge.

[31] From Pixels and Words to Waves: A Unified Framework for Spectral Dictionary vLLMs

Andrew Kiruluta,Priscilla Burity

Main category: cs.CV

TL;DR: 论文提出一种基于频谱字典的混合器方法(SDict-VLM),首次在视觉语言模型中同时去除卷积和自注意力机制,实现了高效且可解释的多模态融合。

Details Motivation: 当前视觉语言模型(VLM)依赖计算密集的卷积和自注意力机制,限制了模型的效率和可扩展性。本文旨在通过频谱字典表示实现轻量化且透明的多模态对齐。

Contribution: 1)提出首个无需卷积和自注意力机制的VLM;2)引入频谱字典混合器,复杂度为O(L log L);3)模型在性能与效率上达到平衡,参数量减少60%,速度提升2.2倍。

Method: 使用频谱字典表示图像块和词元,每个元素被编码为稀疏的频率原子组合。SDict-VLM通过共享频率字典实现跨模态对齐,并支持精度与计算量的可调节权衡。

Result: 在MS-COCO上达到BLEU-4 39.2、CIDEr 127.5、SPICE 27.0,VQAv2准确率50.3%,性能接近BLIP-2的85%,但效率显著提升。

Insight: 频谱字典不仅降低计算复杂度,还提供模型透明性,为高效可解释的VLM开辟了新方向。

Abstract: Vision-language models (VLMs) unify computer vision and natural language processing in a single architecture capable of interpreting and describing images. Most state-of-the-art systems rely on two computationally intensive components: convolutions in the vision encoder and quadratic self-attention for multimodal fusion. This work removes both by introducing a spectral dictionary token mixer, which represents each image patch or wordpiece as a sparse combination of learnable frequency atoms. Our 1.1B-parameter prototype, SDict-VLM, achieves BLEU-4 of 39.2, CIDEr of 127.5, and SPICE of 27.0 on MS-COCO captioning, along with 50.3 percent accuracy on VQAv2. These results close approximately 85 percent of the performance gap to BLIP-2 while using 60 percent fewer parameters, 2.3 times less peak GPU memory, and 2.2 times faster inference than PaLI-3. To our knowledge, this is the first VLM to eliminate both convolutions and self-attention while matching mid-scale transformer baselines. In addition to its O(L log L) complexity, the shared frequency dictionary enables transparent cross-modal alignment and offers a tunable trade-off between accuracy and compute, paving the way for efficient and interpretable VLMs.

[32] DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models

Zhe Dong,Yuzhe Sun,Tianzhu Liu,Yanfeng Gu

Main category: cs.CV

TL;DR: DiffRIS利用预训练的文本到图像扩散模型,通过上下文感知适配器和渐进式跨模态推理解码器,显著提升了遥感图像分割任务的性能,达到最新水平。

Details Motivation: 遥感图像分割(RRSIS)在灾害响应和城市规划中至关重要,但现有方法因尺度变化、多样方向和语义模糊性等问题表现不佳。作者希望通过预训练扩散模型增强跨模态对齐能力。

Contribution: 1. 提出DiffRIS框架,首次将预训练扩散模型引入RRSIS任务;2. 设计CP-adapter动态优化语言特征;3. 开发PCMRD实现渐进式跨模态对齐。

Method: 1. CP-adapter通过全局上下文建模和对象感知推理动态优化语言特征;2. PCMRD通过多尺度特征交互逐步对齐文本与视觉区域。

Result: 在三个基准数据集(RRSIS-D、RefSegRS、RISBench)上DiffRIS均超越现有方法,确立了最新性能。

Insight: 预训练扩散模型能有效提升遥感任务的跨模态对齐能力,且动态特征优化和多尺度交互是关键创新点。

Abstract: Referring remote sensing image segmentation (RRSIS) enables the precise delineation of regions within remote sensing imagery through natural language descriptions, serving critical applications in disaster response, urban development, and environmental monitoring. Despite recent advances, current approaches face significant challenges in processing aerial imagery due to complex object characteristics including scale variations, diverse orientations, and semantic ambiguities inherent to the overhead perspective. To address these limitations, we propose DiffRIS, a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for enhanced cross-modal alignment in RRSIS tasks. Our framework introduces two key innovations: a context perception adapter (CP-adapter) that dynamically refines linguistic features through global context modeling and object-aware reasoning, and a progressive cross-modal reasoning decoder (PCMRD) that iteratively aligns textual descriptions with visual regions for precise segmentation. The CP-adapter bridges the domain gap between general vision-language understanding and remote sensing applications, while PCMRD enables fine-grained semantic alignment through multi-scale feature interaction. Comprehensive experiments on three benchmark datasets-RRSIS-D, RefSegRS, and RISBench-demonstrate that DiffRIS consistently outperforms existing methods across all standard metrics, establishing a new state-of-the-art for RRSIS tasks. The significant performance improvements validate the effectiveness of leveraging pre-trained diffusion models for remote sensing applications through our proposed adaptive framework.

[33] GLIMPSE: Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation for Generative LVLMs

Guanxi Shen

Main category: cs.CV

TL;DR: GLIMPSE是一个轻量级、模型无关的框架,用于可视化大型视觉语言模型(LVLMs)在开放视觉问答(VQA)中的视觉注意力分布。

Details Motivation: 理解LVLMs在生成自由形式文本响应时的视觉注意力分布对于模型行为理解、幻觉诊断、偏差暴露和透明性至关重要。

Contribution: 引入了GLIMPSE,一种通过梯度加权注意力、自适应层传播和加权token聚合来生成响应级别归属热图的方法。

Method: 结合梯度加权注意力、自适应层传播和加权token聚合,生成多模态响应级别归属热图。

Result: GLIMPSE在人类对齐方面优于先前的可解释性方法,并能揭示LVLMs的跨模态归属、token级推理动态和系统人类注意力错位等问题。

Insight: GLIMPSE能够提供对LVLMs跨模态推理的细粒度分析,帮助理解模型行为的透明性,并识别幻觉和偏差。

Abstract: Recent advances in large vision language models (LVLMs) have unlocked unprecedented capabilities in generating coherent responses from visual inputs. However, interpreting where LVLMs direct their visual attention while generating free-form textual responses remains a significant challenge, yet is essential for understanding model behavior, diagnosing hallucination, exposing bias and ensuring transparency. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework for visualizing the salient image regions that LVLMs rely upon during open-ended visual question answering (VQA), while concurrently revealing the multimodal textual saliency. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and weighted token aggregation to produce holistic response-level attribution heat maps for interpreting cross-modal reasoning, outperforming prior interpretability methods in human-alignment. We demonstrate an analytic explainable AI (XAI) approach using GLIMPSE to uncover fine-grained insights into LVLM cross-modal attribution, trace token-level reasoning dynamics, and analyze systematic human-attention misalignment, hallucination, and bias.

[34] LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

Guang Yang,Victoria Ebert,Nazif Tamer,Luiza Pozzobon,Noah A. Smith

Main category: cs.CV

TL;DR: LEGATO是一个新的端到端Transformer模型,用于光学音乐识别(OMR),首次实现大规模预训练,能够识别整页或多页排版乐谱,并生成ABC符号格式。

Details Motivation: 现有OMR模型缺乏对大型排版乐谱的端到端识别能力,且缺乏标准化评估。LEGATO旨在填补这一空白。

Contribution: 1. 首个大规模预训练的端到端OMR模型。2. 支持整页或多页排版乐谱识别和ABC符号生成。3. 提供全面的标准化评估方法。

Method: 结合预训练的视觉编码器和ABC解码器,训练数据集超过214K张图像。

Result: 在多个数据集上实现最先进性能。

Insight: 大规模预训练和端到端设计是提升OMR泛化能力的关键。

Abstract: We propose Legato, a new end-to-end transformer model for optical music recognition (OMR). Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct experiments on a range of datasets and demonstrate that our model achieves state-of-the-art performance. Given the lack of a standardized evaluation for end-to-end OMR, we comprehensively compare our model against the previous state of the art using a diverse set of metrics.

[35] HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

Yimu Wang,Mozhgan Nasr Azadani,Sean Sedwards,Krzysztof Czarnecki

Main category: cs.CV

TL;DR: HAWAII 是一个新颖的视觉语言模型框架,通过将多个视觉专家的知识蒸馏到单个视觉编码器中,实现高效的知识迁移,同时减少计算开销。

Details Motivation: 提高视觉语言模型(VLMs)的视觉理解能力是关键,但使用多个预训练视觉专家通常会导致较高的计算成本。HAWAII 旨在通过知识蒸馏解决这一问题。

Contribution: 1. 提出了一种多教师知识蒸馏框架 HAWAII;2. 设计了教师特定的 LoRA 适配器和路由器,避免冲突;3. 提出细粒度和粗粒度的知识蒸馏方法。

Method: 1. 使用教师特定的 LoRA 适配器为每个教师分配专用路径;2. 通过细粒度(基于令牌重要性)和粗粒度(总结知识)蒸馏迁移知识;3. 路由器动态选择适配器。

Result: 在多种视觉语言任务上的实验表明,HAWAII 优于现有开源 VLMs。

Insight: 通过动态适配器和分层知识蒸馏,可以在保持高效的同时充分利用多教师的互补优势。

Abstract: Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.

[36] Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition

Iosif Tsangko,Andreas Triantafyllopoulos,Adem Abdelmoula,Adria Mallol-Ragolta,Bjoern W. Schuller

Main category: cs.CV

TL;DR: 该论文探讨了基础模型(FMs)在面部情绪识别中依赖的视觉线索,发现模型主要依赖牙齿可见性等表面特征,揭示了潜在的偏见和公平性问题。

Details Motivation: 研究动机在于了解基础模型(尤其是视觉语言模型)在情绪识别中依赖的特征是否具有心理学基础,以及这些模型是否存在偏见或公平性问题。

Contribution: 论文的主要贡献是揭示了VLMs(如GPT-4o)在情绪识别中依赖牙齿可见性等非心理学特征,并指出了这种依赖性可能导致偏见和不公平。

Method: 通过在对牙齿标注的AffectNet子集上进行基准测试,分析不同规模的VLMs的性能变化,并采用结构化内省方法研究GPT-4o的推理过程。

Result: 研究发现,VLMs(如GPT-4o)在情绪识别中主要依赖牙齿可见性等表面特征,且其推理过程具有高度内部一致性。

Insight: 研究揭示了基础模型在情绪识别中的‘捷径学习’行为,强调了在心理健康和教育等敏感领域中可能存在的偏见和公平性问题。

Abstract: Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.

[37] Lightweight RGB-T Tracking with Mobile Vision Transformers

Mahdi Falaki,Maria A. Amer

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于Mobile Vision Transformers (MobileViT)的轻量级RGB-T目标跟踪算法,通过渐进式融合框架和可分离注意力机制实现了高效的模态内和模态间交互,在保持高精度的同时显著降低了模型参数量和提升了推理速度。

Details Motivation: 单模态目标跟踪(如仅RGB)在低光照和恶劣天气等挑战性条件下表现不佳。虽然基于Vision Transformer的多模态跟踪器性能优越,但模型通常计算成本高。因此,作者希望开发一种轻量级且高效的多模态跟踪算法。

Contribution: 1) 首次将MobileViT引入RGB-T多模态跟踪;2) 提出渐进式融合框架和可分离注意力机制,实现高效的模态交互;3) 模型参数量低(<4百万),推理速度达122 FPS,性能与SOTA相当。

Method: 采用MobileViT作为骨干网络,设计渐进式融合框架,结合可分离注意力机制,联合学习模板和搜索区域的模态内与模态间特征交互。

Result: 与SOTA高效多模态跟踪器相比,模型在参数量(<4百万)和推理速度(122 FPS)上显著优化,同时保持了可比的跟踪精度。

Insight: 轻量化ViT架构(如MobileViT)在多模态任务中具有潜力,渐进式融合和可分离注意力是提升模型效率的有效手段。

Abstract: Single-modality object tracking (e.g., RGB-only) encounters difficulties in challenging imaging conditions, such as low illumination and adverse weather conditions. To solve this, multimodal tracking (e.g., RGB-T models) aims to leverage complementary data such as thermal infrared features. While recent Vision Transformer-based multimodal trackers achieve strong performance, they are often computationally expensive due to large model sizes. In this work, we propose a novel lightweight RGB-T tracking algorithm based on Mobile Vision Transformers (MobileViT). Our tracker introduces a progressive fusion framework that jointly learns intra-modal and inter-modal interactions between the template and search regions using separable attention. This design produces effective feature representations that support more accurate target localization while achieving a small model size and fast inference speed. Compared to state-of-the-art efficient multimodal trackers, our model achieves comparable accuracy while offering significantly lower parameter counts (less than 4 million) and the fastest GPU inference speed of 122 frames per second. This paper is the first to propose a tracker using Mobile Vision Transformers for RGB-T tracking and multimodal tracking at large. Tracker code and model weights will be made publicly available upon acceptance.

[38] PRISM: Perceptual Recognition for Identifying Standout Moments in Human-Centric Keyframe Extraction

Mert Can Cakmak,Nitin Agarwal,Diwash Poudel

Main category: cs.CV

TL;DR: PRISM是一种轻量级、基于感知对齐的框架,用于提取视频中的人为中心关键帧,适用于实时和资源受限环境。

Details Motivation: 在线视频在政治话语和网络社交威胁(如虚假信息、宣传和极端化)中扮演重要角色,识别视频中最具影响力的“突出”时刻对内容审核和总结至关重要。

Contribution: PRISM是一种无需训练、可解释且计算高效的关键帧提取方法,利用了CIELAB色彩空间和感知颜色差异指标。

Method: PRISM在CIELAB色彩空间中工作,使用感知颜色差异指标识别与人类视觉敏感度对齐的帧。

Result: 在BBC、TVSum、SumMe和ClipShots数据集上的实验表明,PRISM在保持高压缩比的同时实现了高准确性和保真度。

Insight: PRISM为分析在线平台上有害或政治敏感媒体提供了一种可扩展的工具,特别适用于资源受限的环境。

Abstract: Online videos play a central role in shaping political discourse and amplifying cyber social threats such as misinformation, propaganda, and radicalization. Detecting the most impactful or “standout” moments in video content is crucial for content moderation, summarization, and forensic analysis. In this paper, we introduce PRISM (Perceptual Recognition for Identifying Standout Moments), a lightweight and perceptually-aligned framework for keyframe extraction. PRISM operates in the CIELAB color space and uses perceptual color difference metrics to identify frames that align with human visual sensitivity. Unlike deep learning-based approaches, PRISM is interpretable, training-free, and computationally efficient, making it well suited for real-time and resource-constrained environments. We evaluate PRISM on four benchmark datasets: BBC, TVSum, SumMe, and ClipShots, and demonstrate that it achieves strong accuracy and fidelity while maintaining high compression ratios. These results highlight PRISM’s effectiveness in both structured and unstructured video content, and its potential as a scalable tool for analyzing and moderating harmful or politically sensitive media in online platforms.

[39] MOSCARD – Causal Reasoning and De-confounding for Multimodal Opportunistic Screening of Cardiovascular Adverse Events

Jialu Pi,Juan Maria Farina,Rimita Lahiri,Jiwoong Jeong,Archana Gurudu,Hyung-Bok Park,Chieh-Ju Chao,Chadi Ayoub,Reza Arsanjani,Imon Banerjee

Main category: cs.CV

TL;DR: 论文提出了一种名为MOSCARD的多模态因果推理框架,用于心血管不良事件的筛查。通过结合胸片(CXR)和心电图(ECG)数据,利用因果推理和去混杂技术提升预测性能。

Details Motivation: 心血管不良事件(MACE)是全球主要死亡原因之一,而现有的筛查方法受限于采样偏差和单模态数据的局限性。本文旨在通过多模态数据整合和因果推理提升筛查的准确性和鲁棒性。

Contribution: 1. 提出多模态对齐方法(CXR与ECG的引导对齐);2. 整合因果推理框架;3. 设计双反向传播图用于去混杂。

Method: 采用多模态因果推理框架,结合CXR和ECG数据,利用co-attention机制对齐模态,并通过双反向传播图消除混杂因素。

Result: 在内部和外部数据集(ED和MIMIC)上的实验表明,MOSCARD优于单模态和现有最佳模型(AUC分别为0.75、0.83、0.71)。

Insight: 多模态数据的整合和因果推理可以显著提升心血管事件筛查的性能,同时去混杂技术有助于减少偏差,提高模型的鲁棒性。

Abstract: Major Adverse Cardiovascular Events (MACE) remain the leading cause of mortality globally, as reported in the Global Disease Burden Study 2021. Opportunistic screening leverages data collected from routine health check-ups and multimodal data can play a key role to identify at-risk individuals. Chest X-rays (CXR) provide insights into chronic conditions contributing to major adverse cardiovascular events (MACE), while 12-lead electrocardiogram (ECG) directly assesses cardiac electrical activity and structural abnormalities. Integrating CXR and ECG could offer a more comprehensive risk assessment than conventional models, which rely on clinical scores, computed tomography (CT) measurements, or biomarkers, which may be limited by sampling bias and single modality constraints. We propose a novel predictive modeling framework - MOSCARD, multimodal causal reasoning with co-attention to align two distinct modalities and simultaneously mitigate bias and confounders in opportunistic risk estimation. Primary technical contributions are - (i) multimodal alignment of CXR with ECG guidance; (ii) integration of causal reasoning; (iii) dual back-propagation graph for de-confounding. Evaluated on internal, shift data from emergency department (ED) and external MIMIC datasets, our model outperformed single modality and state-of-the-art foundational models - AUC: 0.75, 0.83, 0.71 respectively. Proposed cost-effective opportunistic screening enables early intervention, improving patient outcomes and reducing disparities.

[40] OpenWildlife: Open-Vocabulary Multi-Species Wildlife Detector for Geographically-Diverse Aerial Imagery

Muhammed Patel,Javier Noa Turnes,Jayden Hsiao,Linlin Xu,David Clausi

Main category: cs.CV

TL;DR: OpenWildlife (OW) 是一个开词汇的野生动物检测器,通过语言感知嵌入和改进的 Grounding-DINO 框架,实现多物种识别,并在多样化的空中图像中表现优异。

Details Motivation: 现有方法在特定环境下表现良好,但在跨物种和跨环境的泛化能力上表现不足,限制了在生物多样性评估中的应用。

Contribution: 1. 提出 OpenWildlife (OW),支持通过自然语言输入识别多物种;2. 引入高效搜索算法;3. 公开源代码和数据集支持复现。

Method: 结合语言感知嵌入和改进的 Grounding-DINO 框架,并设计了一种结合 k-近邻和广度优先搜索的高效搜索算法。

Result: 在 15 个数据集上训练,OW 表现优异(最高 0.981 mAP50),并在新物种数据集上达到 0.597 mAP50。高效算法覆盖 95% 物种,仅需探索 33% 图像。

Insight: OW 展示了开词汇模型和语言嵌入在生态学中的潜力,为全球生物多样性评估提供了灵活、高效的工具。

Abstract: We introduce OpenWildlife (OW), an open-vocabulary wildlife detector designed for multi-species identification in diverse aerial imagery. While existing automated methods perform well in specific settings, they often struggle to generalize across different species and environments due to limited taxonomic coverage and rigid model architectures. In contrast, OW leverages language-aware embeddings and a novel adaptation of the Grounding-DINO framework, enabling it to identify species specified through natural language inputs across both terrestrial and marine environments. Trained on 15 datasets, OW outperforms most existing methods, achieving up to \textbf{0.981} mAP50 with fine-tuning and \textbf{0.597} mAP50 on seven datasets featuring novel species. Additionally, we introduce an efficient search algorithm that combines k-nearest neighbors and breadth-first search to prioritize areas where social species are likely to be found. This approach captures over \textbf{95%} of species while exploring only \textbf{33%} of the available images. To support reproducibility, we publicly release our source code and dataset splits, establishing OW as a flexible, cost-effective solution for global biodiversity assessments.

[41] Ancient Script Image Recognition and Processing: A Review

Xiaolei Diao,Rite Bo,Yanling Xiao,Lida Shi,Zhihan Zhou,Hao Xu,Chuntao Li,Xiongfeng Tang,Massimo Poesio,Cédric M. John,Daqian Shi

Main category: cs.CV

TL;DR: 这篇综述论文全面回顾了古代文字图像识别的方法,分析了不同文字类型及其识别技术的差异与共性,探讨了数据不平衡和图像退化等独特挑战,并总结了当前局限性和未来方向。

Details Motivation: 古代文字作为人类文明的重要载体,其自动识别技术对考古学和数字人文学科的研究至关重要。随着深度学习的兴起,该领域发展迅速,但面临数据不平衡和图像退化等独特挑战。

Contribution: 1. 对不同古代文字类型及其识别方法进行分类和分析;2. 系统探讨了古代文字识别的独特挑战及其解决方案;3. 总结了当前局限性和未来研究方向。

Method: 论文采用综述方法,通过分类和分析现有研究,重点关注少样本学习和噪声鲁棒技术等解决方案。

Result: 论文提供了一个结构化的视角,支持古代文字识别、解释和破译的持续发展。

Insight: 古代文字识别技术在不同类型的文字中存在共性方法,但也需针对各自的独特挑战开发专门解决方案,未来研究方向可以进一步结合多模态和跨领域知识。

Abstract: Ancient scripts, e.g., Egyptian hieroglyphs, Oracle Bone Inscriptions, and Ancient Greek inscriptions, serve as vital carriers of human civilization, embedding invaluable historical and cultural information. Automating ancient script image recognition has gained importance, enabling large-scale interpretation and advancing research in archaeology and digital humanities. With the rise of deep learning, this field has progressed rapidly, with numerous script-specific datasets and models proposed. While these scripts vary widely, spanning phonographic systems with limited glyphs to logographic systems with thousands of complex symbols, they share common challenges and methodological overlaps. Moreover, ancient scripts face unique challenges, including imbalanced data distribution and image degradation, which have driven the development of various dedicated methods. This survey provides a comprehensive review of ancient script image recognition methods. We begin by categorizing existing studies based on script types and analyzing respective recognition methods, highlighting both their differences and shared strategies. We then focus on challenges unique to ancient scripts, systematically examining their impact and reviewing recent solutions, including few-shot learning and noise-robust techniques. Finally, we summarize current limitations and outline promising future directions. Our goal is to offer a structured, forward-looking perspective to support ongoing advancements in the recognition, interpretation, and decipherment of ancient scripts.

[42] MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports

Sunggu Kyung,Hyungbin Park,Jinyoung Seo,Jimin Sung,Jihyun Kim,Dongyeong Kim,Wooyoung Jo,Yoojin Nam,Sangah Park,Taehee Kwon,Sang Min Lee,Namkug Kim

Main category: cs.CV

TL;DR: MedErr-CT是一个新的视觉问答基准,用于评估多模态大语言模型(MLLMs)在医学CT报告中识别和纠正错误的能力。

Details Motivation: CT在临床诊断中至关重要,但诊断错误问题日益突出。现有医学视觉问答基准缺乏临床相关性,无法评估专家级知识。

Contribution: 提出了MedErr-CT基准,包含六种错误类别和三个任务级别,用于评估医学MLLMs的错误识别和纠正能力。

Method: 设计了六种错误类别(视觉中心和词汇错误)和三个任务级别(分类、检测和纠正),并通过VQA框架评估MLLMs。

Result: 评估了当前最先进的3D医学MLLMs,发现它们在不同错误类型上表现差异显著。

Insight: 通过MedErr-CT基准,可以推动开发更可靠和临床适用的MLLMs,减少诊断错误和提高临床准确性。

Abstract: Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs’ ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo) - and is organized into three task levels: classification, detection, and correction. Using this benchmark, we quantitatively assess the performance of state-of-the-art 3D medical MLLMs, revealing substantial variation in their capabilities across different error types. Our benchmark contributes to the development of more reliable and clinically applicable MLLMs, ultimately helping reduce diagnostic errors and improve accuracy in clinical practice. The code and datasets are available at https://github.com/babbu3682/MedErr-CT.

[43] Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification

Minghao Qin,Xiangrui Liu,Zhengyang Liang,Yan Shu,Huaying Yuan,Juenjie Zhou,Shitao Xiao,Bo Zhao,Zheng Liu

Main category: cs.CV

TL;DR: Video-XL-2通过任务感知的KV稀疏化技术,解决了长视频理解中高计算和内存成本的问题,实现了高效且性能优越的长视频分析。

Details Motivation: 当前的多模态大语言模型在处理长视频时面临高内存和计算成本的挑战,需要在性能和效率之间取得平衡。

Contribution: 提出了基于任务感知KV稀疏化的Video-XL-2框架,包含分块预填充和双层KV解码技术,显著提升了长视频理解的能力和效率。

Method: 1. 分块预填充:将视觉令牌分块,块内使用全注意力,块间使用稀疏注意力。2. 双层KV解码:根据任务相关性选择性加载稠密或稀疏KV。

Result: 在多个长视频理解基准测试中达到SOTA性能,单卡GPU可处理超1万帧视频,处理速度达每秒数千帧。

Insight: KV稀疏化结合任务感知的分块策略,是提升长视频理解效率和性能的有效途径。

Abstract: Multi-modal large language models (MLLMs) models have made significant progress in video understanding over the past few years. However, processing long video inputs remains a major challenge due to high memory and computational costs. This makes it difficult for current models to achieve both strong performance and high efficiency in long video understanding. To address this challenge, we propose Video-XL-2, a novel MLLM that delivers superior cost-effectiveness for long-video understanding based on task-aware KV sparsification. The proposed framework operates with two key steps: chunk-based pre-filling and bi-level key-value decoding. Chunk-based pre-filling divides the visual token sequence into chunks, applying full attention within each chunk and sparse attention across chunks. This significantly reduces computational and memory overhead. During decoding, bi-level key-value decoding selectively reloads either dense or sparse key-values for each chunk based on its relevance to the task. This approach further improves memory efficiency and enhances the model’s ability to capture fine-grained information. Video-XL-2 achieves state-of-the-art performance on various long video understanding benchmarks, outperforming existing open-source lightweight models. It also demonstrates exceptional efficiency, capable of processing over 10,000 frames on a single NVIDIA A100 (80GB) GPU and thousands of frames in just a few seconds.

[44] MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

Yinan Xia,Yilei Jiang,Yingshui Tan,Xiaoyong Zhu,Xiangyu Yue,Bo Zheng

Main category: cs.CV

TL;DR: MSR-Align是一个高质量的多模态安全推理数据集,旨在通过细粒度的政策基础推理增强视觉语言模型的安全性,同时提升其对文本和视觉-语言攻击的鲁棒性。

Details Motivation: 现有的安全对齐方法主要针对单模态语言模型,无法应对多模态输入带来的复杂安全威胁。因此,需要一种针对多模态推理能力的视觉语言模型(VLMs)的安全对齐方法。

Contribution: 提出了MSR-Align数据集,支持细粒度的政策基础推理,并通过实验证明其能显著提升VLMs对攻击的鲁棒性,同时保持或提升一般推理能力。

Method: 使用强调多模态多样性、政策基础推理和严格质量过滤的数据生成流水线,构建了一个高质量的多模态安全推理数据集。

Result: 实验表明,基于MSR-Align微调的VLMs在面对文本和视觉-语言攻击时表现出更强的鲁棒性,且一般推理性能不受影响甚至有所提升。

Insight: 细粒度的政策基础推理是提升多模态模型安全性的关键,同时高质量的数据集是推动安全性对齐研究的基础。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.

[45] Self-Paced Collaborative and Adversarial Network for Unsupervised Domain Adaptation

Weichen Zhang,Dong Xu,Wanli Ouyang,Wen Li

Main category: cs.CV

TL;DR: 该论文提出了一种名为CAN的无监督域适应方法,结合域协作和域对抗学习策略,通过正负权重损失统一二者,并设计了自动学习域特定和域不变特征的训练方案。进一步提出的SPCAN通过自步学习选择伪标签目标样本提升性能,在多个基准数据集上取得了最先进的结果。

Details Motivation: 无监督域适应的核心目标是缩小源域和目标域之间的分布差异,同时保持目标域的判别性。传统方法通常单独处理域不变性或判别性,但如何有效统一这两种需求仍是一个挑战。

Contribution: 1. 提出CAN方法,统一域协作和域对抗学习;2. 设计自动学习域特定和域不变特征的训练方案;3. 进一步提出SPCAN,通过自步学习选择伪标签目标样本提升判别性。

Method: 1. 域协作学习通过正权重损失保留目标域判别性;2. 域对抗学习通过负权重损失减少域分布差异;3. SPCAN采用自步学习策略从易到难选择伪标签样本进行重新训练。

Result: 在Office-31、ImageCLEF-DA、VISDA-2017等数据集上实现了最先进的性能,验证了方法的有效性。

Insight: 统一域协作和域对抗学习能够更全面地解决无监督域适应问题,而自步学习策略能有效提升目标域的分类性能。

Abstract: This paper proposes a new unsupervised domain adaptation approach called Collaborative and Adversarial Network (CAN), which uses the domain-collaborative and domain-adversarial learning strategy for training the neural network. The domain-collaborative learning aims to learn domain-specific feature representation to preserve the discriminability for the target domain, while the domain adversarial learning aims to learn domain-invariant feature representation to reduce the domain distribution mismatch between the source and target domains. We show that these two learning strategies can be uniformly formulated as domain classifier learning with positive or negative weights on the losses. We then design a collaborative and adversarial training scheme, which automatically learns domain-specific representations from lower blocks in CNNs through collaborative learning and domain-invariant representations from higher blocks through adversarial learning. Moreover, to further enhance the discriminability in the target domain, we propose Self-Paced CAN (SPCAN), which progressively selects pseudo-labeled target samples for re-training the classifiers. We employ a self-paced learning strategy to select pseudo-labeled target samples in an easy-to-hard fashion. Comprehensive experiments on different benchmark datasets, Office-31, ImageCLEF-DA, and VISDA-2017 for the object recognition task, and UCF101-10 and HMDB51-10 for the video action recognition task, show our newly proposed approaches achieve the state-of-the-art performance, which clearly demonstrates the effectiveness of our proposed approaches for unsupervised domain adaptation.

[46] AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration

Xiangbo Gao,Yuheng Wu,Xuewen Luo,Keshu Wu,Xinghao Chen,Yuping Wang,Chenxi Liu,Yang Zhou,Zhengzhong Tu

Main category: cs.CV

TL;DR: AirV2X提案一個統一的空中-地面V2X協作框架,利用無人機(Drones)作為固定路側單元(RSUs)的靈活替代或補充,解決傳統V2X系統的高部署成本和覆蓋盲區問題。

Details Motivation: 傳統V2X系統在農村和郊區存在高成本和覆蓋盲區問題,而無人機的動態性和低成本特性為此提供了新的解決方案。

Contribution: 提出大規模數據集AirV2X-Perception,支持V2D算法的開發與標準化評估,彌補空中輔助自動駕駛系統領域的空白。

Method: 利用無人機的動態定位能力和鳥瞰視角,收集多場景(城郊鄉村)和多條件(天氣、光照)下的6.73小時數據。

Result: 數據集開源於GitHub,支持無人機輔助自動駕駛的研究與應用。

Insight: 無人機的靈活性為V2X系統提供了新的可能,尤其是在成本和覆蓋範圍方面,具有顯著優勢。

Abstract: While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of “uncovered danger zones” in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird’s-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at https://github.com/taco-group/AirV2X-Perception.

[47] Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

Runwei Guan,Ningwei Ouyang,Tianhao Xu,Shaofeng Liang,Wei Dai,Yafeng Sun,Shang Gao,Songning Lai,Shanliang Yao,Xuming Hu,Ryan Wen Liu,Yutao Yue,Hui Xiong

Main category: cs.CV

TL;DR: 该论文提出了首个专注于水道环境的图像描述数据集WaterCaption,并提出了一种可部署在边缘设备上的多模态大语言模型Da Yu,通过Nano Transformer Adaptor(NTA)实现高效的长文本生成。

Details Motivation: 水道环境的复杂性使得现有感知模型难以实现全局语义理解,限制了大规模监测与结构化日志生成。通过结合视觉-语言模型(VLMs),论文旨在利用图像描述技术提升水道监视和场景理解能力。

Contribution: 1. 提出首个水道环境专用图像描述数据集WaterCaption,包含20.2k图像-文本对,词汇量达180万。2. 设计了可边缘部署的多模态大语言模型Da Yu,引入Nano Transformer Adaptor(NTA)以高效建模全局和局部视觉特征,生成长文本描述。

Method: 1. 构建WaterCaption数据集,专注于细粒度、多区域的长文本描述。2. 提出Da Yu模型,利用NTA在视觉到语言的投影中平衡计算效率与特征建模能力,优化长文本生成性能。

Result: Da Yu模型在WaterCaption及其他图像描述基准测试中,性能优于现有最先进模型,同时保持了较高的效率。

Insight: 论文展示了细粒度图像描述在水道环境理解中的潜力,同时为边缘设备的视觉-语言模型部署提供了新思路,尤其是在复杂开放水域场景中。

Abstract: Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model’s ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.

[48] HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis

Xiaoyuan Wang,Yizhou Zhao,Botao Ye,Xiaojun Shan,Weijie Lyu,Lu Qi,Kelvin C. K. Chan,Yinxiao Li,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: HoliGS 是一种新颖的可变形高斯泼溅框架,用于从长时单目RGB视频中实现高效的视角合成,显著降低训练和渲染时间。

Details Motivation: 现有的4D高斯泼溅和动态NeRF方法在处理长时捕捉数据时训练开销大,HoliGS旨在提供一种高效且可扩展的解决方案。

Contribution: 提出了层次化的高斯泼溅变形框架,将场景分解为静态背景和动态对象,并通过可逆神经流实现非刚性变形,显著提升视角合成质量。

Method: 使用可逆高斯泼溅变形网络,结合全局刚性变换、骨架驱动和神经流实现对象的动态建模,支持多视角的自由渲染。

Result: 在复杂数据集上展现了优越的重建质量,同时显著降低了训练和渲染时间。

Insight: 通过层次化变形策略,HoliGS为实际场景中的视角合成提供了一种高效且可扩展的方法,特别适用于多视角交互场景。

Abstract: We propose HoliGS, a novel deformable Gaussian splatting framework that addresses embodied view synthesis from long monocular RGB videos. Unlike prior 4D Gaussian splatting and dynamic NeRF pipelines, which struggle with training overhead in minute-long captures, our method leverages invertible Gaussian Splatting deformation networks to reconstruct large-scale, dynamic environments accurately. Specifically, we decompose each scene into a static background plus time-varying objects, each represented by learned Gaussian primitives undergoing global rigid transformations, skeleton-driven articulation, and subtle non-rigid deformations via an invertible neural flow. This hierarchical warping strategy enables robust free-viewpoint novel-view rendering from various embodied camera trajectories by attaching Gaussians to a complete canonical foreground shape (\eg, egocentric or third-person follow), which may involve substantial viewpoint changes and interactions between multiple actors. Our experiments demonstrate that \ourmethod~ achieves superior reconstruction quality on challenging datasets while significantly reducing both training and rendering time compared to state-of-the-art monocular deformable NeRFs. These results highlight a practical and scalable solution for EVS in real-world scenarios. The source code will be released.

[49] Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models

Kai Zhao,Wubang Yuan,Zheng Wang,Guanyi Li,Xiaoqiang Zhu,Deng-ping Fan,Dan Zeng

Main category: cs.CV

TL;DR: 本文提出了一种基于多级视觉语言模型(VLM)的新框架,用于开放词汇伪装目标分割(OVCOS),通过结合VLM和SAM模型,解决了传统方法中的领域差异和边界模糊问题。

Details Motivation: 开放词汇伪装目标分割(OVCOS)需要从任意类别中分割和分类伪装目标,但现有方法因视觉模糊性和未见过类别面临挑战。传统两阶段方法存在领域差异和通用分割模型对伪装目标效果不佳的问题。

Contribution: 1. 提出了一种VLM引导的多级框架,通过VLM特征作为SAM的显式提示,提升了伪装目标的定位精度。2. 提出了一种软空间先验方法,避免了硬裁剪导致的领域差异,提升了分类准确性。

Method: 1. 利用SAM进行分割,VLM的特征作为提示,引导模型关注伪装区域。2. 通过alpha通道保留完整图像上下文和空间指导,实现更精确的分类。

Result: 在OVCOS和传统伪装目标分割基准测试中,该方法表现出显著优势,验证了VLM语义在分割和分类中的有效性。

Insight: 结合VLM和SAM能够有效解决伪装目标分割中的领域差异和边界模糊问题,软空间先验方法为多模态任务提供了新思路。

Abstract: Open-Vocabulary Camouflaged Object Segmentation (OVCOS) seeks to segment and classify camouflaged objects from arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories.Recent approaches typically adopt a two-stage paradigm: first segmenting objects, then classifying the segmented regions using Vision Language Models (VLMs).However, these methods (1) suffer from a domain gap caused by the mismatch between VLMs’ full-image training and cropped-region inference, and (2) depend on generic segmentation models optimized for well-delineated objects, making them less effective for camouflaged objects.Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation.In this paper,we introduce a novel VLM-guided cascaded framework to address these issues in OVCOS.For segmentation, we leverage the Segment Anything Model (SAM), guided by the VLM.Our framework uses VLM-derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy.For classification, we avoid the domain gap introduced by hard cropping.Instead, we treat the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects.The same VLM is shared across both segmentation and classification to ensure efficiency and semantic consistency.Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects.

[50] Airway Skill Assessment with Spatiotemporal Attention Mechanisms Using Human Gaze

Jean-Paul Ainam,Rahul,Lora Cavuoto,Matthew Hackett,Jack Norfleet,Suvranu De

Main category: cs.CV

TL;DR: 该论文提出了一种基于机器学习的评估气道管理技能的方法,结合人类注视数据和视频记录,用于评估气管插管(ETI)技能。通过注意力机制和视觉掩码增强模型对关键区域的关注,提高了分类准确性和效率。

Details Motivation: 气道管理技能在急救医学中至关重要,但传统评估方法主观性强且难以反映真实场景中的能力。该研究旨在通过人类注视数据和机器学习提供一种客观、高效的评估工具。

Contribution: 1. 首次提出利用人类注视数据评估ETI技能的方法;2. 设计了一种结合注意力机制和视觉掩码的模型,显著提高了分类性能;3. 为临床技能评估提供了可扩展的自动化工具。

Method: 1. 使用人类注视数据生成视觉掩码,指导模型关注任务相关区域;2. 采用自编码器提取视频特征;3. 结合注意力模块和分类器生成分类得分。

Result: 该方法在预测准确性、敏感性和可信度方面均优于传统方法,尤其在高压环境下(如军事场景)表现优异。

Insight: 人类注视数据能够有效引导注意力机制,提升模型在复杂任务中的表现。这种方法可扩展到其他临床技能评估领域,具有广泛的应用潜力。

Abstract: Airway management skills are critical in emergency medicine and are typically assessed through subjective evaluation, often failing to gauge competency in real-world scenarios. This paper proposes a machine learning-based approach for assessing airway skills, specifically endotracheal intubation (ETI), using human gaze data and video recordings. The proposed system leverages an attention mechanism guided by the human gaze to enhance the recognition of successful and unsuccessful ETI procedures. Visual masks were created from gaze points to guide the model in focusing on task-relevant areas, reducing irrelevant features. An autoencoder network extracts features from the videos, while an attention module generates attention from the visual masks, and a classifier outputs a classification score. This method, the first to use human gaze for ETI, demonstrates improved accuracy and efficiency over traditional methods. The integration of human gaze data not only enhances model performance but also offers a robust, objective assessment tool for clinical skills, particularly in high-stress environments such as military settings. The results show improvements in prediction accuracy, sensitivity, and trustworthiness, highlighting the potential for this approach to improve clinical training and patient outcomes in emergency medicine.

[51] Capturing Fine-Grained Alignments Improves 3D Affordance Detection

Junsei Tokumitsu,Yuiga Wada

Main category: cs.CV

TL;DR: 该论文提出了一种名为LM-AD的新方法,用于3D点云中的功能检测,通过引入Affordance Query Module (AQM)捕捉点云与文本之间的细粒度对齐,显著提升了现有方法的性能。

Details Motivation: 现有方法在3D点云功能检测中依赖简单的余弦相似度计算点云与文本嵌入,无法有效捕捉细粒度对齐,导致性能受限。

Contribution: 1. 提出LM-AD方法,显著改进3D功能检测任务。2. 引入Affordance Query Module (AQM),利用预训练语言模型实现点云与文本的细粒度对齐。

Method: 1. 使用预训练语言模型生成文本嵌入。2. 设计AQM模块,通过注意力机制捕捉点云与文本的细粒度对齐。

Result: 在3D AffordanceNet数据集上,LM-AD在准确率和平均IoU上均优于现有方法。

Insight: 细粒度对齐是提升3D功能检测性能的关键,预训练语言模型为此提供了有效工具。

Abstract: In this work, we address the challenge of affordance detection in 3D point clouds, a task that requires effectively capturing fine-grained alignments between point clouds and text. Existing methods often struggle to model such alignments, resulting in limited performance on standard benchmarks. A key limitation of these approaches is their reliance on simple cosine similarity between point cloud and text embeddings, which lacks the expressiveness needed for fine-grained reasoning. To address this limitation, we propose LM-AD, a novel method for affordance detection in 3D point clouds. Moreover, we introduce the Affordance Query Module (AQM), which efficiently captures fine-grained alignment between point clouds and text by leveraging a pretrained language model. We demonstrated that our method outperformed existing approaches in terms of accuracy and mean Intersection over Union on the 3D AffordanceNet dataset.

[52] Progressive Modality Cooperation for Multi-Modality Domain Adaptation

Weichen Zhang,Dong Xu,Jing Zhang,Wanli Ouyang

Main category: cs.CV

TL;DR: 论文提出了一种名为渐进式模态合作(PMC)的新框架,用于多模态域适应(MMDA)和特权信息多模态域适应(MMDA-PI)任务,通过多模态数据提升知识迁移效果。

Details Motivation: 在多模态域适应任务中,如何有效利用多模态数据并解决目标域中模态缺失的问题是一个关键挑战。

Contribution: 1. 提出了PMC框架,通过渐进式模态合作提升多模态数据的利用效率;2. 针对目标域模态缺失,设计了PMC-PI方法,结合多模态数据生成网络(MMG)生成缺失模态。

Method: 1. PMC通过两个模块选择可靠的伪标记目标样本;2. PMC-PI通过MMG网络生成目标域缺失模态,结合对抗学习和语义加权保持域分布和语义一致性。

Result: 在三种图像和八种视频数据集上的实验验证了PMC和PMC-PI在跨域视觉任务中的有效性。

Insight: 多模态数据可以通过渐进式合作和生成网络显著提升域适应任务的性能,尤其在模态缺失的情况下。

Abstract: In this work, we propose a new generic multi-modality domain adaptation framework called Progressive Modality Cooperation (PMC) to transfer the knowledge learned from the source domain to the target domain by exploiting multiple modality clues (\eg, RGB and depth) under the multi-modality domain adaptation (MMDA) and the more general multi-modality domain adaptation using privileged information (MMDA-PI) settings. Under the MMDA setting, the samples in both domains have all the modalities. In two newly proposed modules of our PMC, the multiple modalities are cooperated for selecting the reliable pseudo-labeled target samples, which captures the modality-specific information and modality-integrated information, respectively. Under the MMDA-PI setting, some modalities are missing in the target domain. Hence, to better exploit the multi-modality data in the source domain, we further propose the PMC with privileged information (PMC-PI) method by proposing a new multi-modality data generation (MMG) network. MMG generates the missing modalities in the target domain based on the source domain data by considering both domain distribution mismatch and semantics preservation, which are respectively achieved by using adversarial learning and conditioning on weighted pseudo semantics. Extensive experiments on three image datasets and eight video datasets for various multi-modality cross-domain visual recognition tasks under both MMDA and MMDA-PI settings clearly demonstrate the effectiveness of our proposed PMC framework.

[53] Continual Retinal Vision-Language Pre-training upon Incremental Imaging Modalities

Yuang Yao,Ruiqi Wu,Yi Zhou,Tao Zhou

Main category: cs.CV

TL;DR: 该论文提出了一种名为RetCoP的持续视觉-语言预训练框架,用于逐步整合不同模态的眼底图像和文本特征到一个统一的预训练模型中,解决了传统单模态模型的局限性。

Details Motivation: 传统眼底图像分析模型专注于单模态任务,忽略了多模态互补性,限制了其通用性。现有的大多数视网膜基础模型仍是模态特定的,因此需要一种动态环境下持续整合多模态数据的方法。

Contribution: 首次在眼底领域提出持续视觉-语言预训练框架RetCoP,通过逐步整合图像和文本特征构建统一模型,并提出防止灾难性遗忘的排练策略和对角信息蒸馏方法。

Method: RetCoP框架通过排练策略重用代表性图像-文本对,并通过离对角信息蒸馏显式保持图像与文本表征的对齐。

Result: 实验表明,RetCoP优于所有对比方法,表现出最佳泛化能力和最低遗忘率。

Insight: 多模态数据逐步整合是关键挑战,RetCoP的成功表明持续学习和表征对齐在动态环境中的重要性。

Abstract: Traditional fundus image analysis models focus on single-modal tasks, ignoring fundus modality complementarity, which limits their versatility. Recently, retinal foundation models have emerged, but most still remain modality-specific. Integrating multiple fundus imaging modalities into a single foundation model is valuable. However, in dynamic environments, data from different modalities often arrive incrementally, necessitating continual pre-training. To address this, we propose RetCoP, the first continual vision-language pre-training framework in the fundus domain, which incrementally integrates image and text features from different imaging modalities into a single unified foundation model. To mitigate catastrophic forgetting in continual pre-training, we introduce a rehearsal strategy utilizing representative image-text pairs and an off-diagonal information distillation approach. The former allows the model to revisit knowledge from previous stages, while the latter explicitly preserves the alignment between image and text representations. Experiments show that RetCoP outperforms all the compared methods, achieving the best generalization and lowest forgetting rate. The code can be found at https://github.com/Yuang-Yao/RetCoP.

[54] Memory-Augmented Incomplete Multimodal Survival Prediction via Cross-Slide and Gene-Attentive Hypergraph Learning

Mingcheng Qu,Guang Yang,Donglin Di,Yue Gao,Tonghua Su,Yang Song,Lei Fan

Main category: cs.CV

TL;DR: 该论文提出了一种基于超图学习的内存增强多模态生存预测框架,整合多幻灯片信息和病理-基因组交互,解决模态不平衡问题,并通过记忆机制补偿不完整模态。

Details Motivation: 现有方法主要整合FFPE幻灯片与基因组数据,忽视了其他保存方式如FF幻灯片,同时高分辨率病理数据主导跨模态融合,导致模态不平衡和不完整数据限制临床适用性。

Contribution: 提出超图学习方法整合多幻灯片和跨模态交互,引入内存机制动态补偿不完整模态,显著提升生存预测性能。

Method: 采用超图学习整合多幻灯片信息和病理-基因组交互,结合内存机制存储和学习配对特征以补偿缺失模态。

Result: 在五个TCGA数据集上C-Index超出先进方法2.3%,不完整模态下优于仅病理(3.3%)和仅基因模型(7.9%)。

Insight: 内存机制和超图学习的结合有效解决了模态不平衡和不完整数据问题,提升了多模态生存预测的鲁棒性和临床适用性。

Abstract: Multimodal pathology-genomic analysis is critical for cancer survival prediction. However, existing approaches predominantly integrate formalin-fixed paraffin-embedded (FFPE) slides with genomic data, while neglecting the availability of other preservation slides, such as Fresh Froze (FF) slides. Moreover, as the high-resolution spatial nature of pathology data tends to dominate the cross-modality fusion process, it hinders effective multimodal fusion and leads to modality imbalance challenges between pathology and genomics. These methods also typically require complete data modalities, limiting their clinical applicability with incomplete modalities, such as missing either pathology or genomic data. In this paper, we propose a multimodal survival prediction framework that leverages hypergraph learning to effectively integrate multi-WSI information and cross-modality interactions between pathology slides and genomics data while addressing modality imbalance. In addition, we introduce a memory mechanism that stores previously learned paired pathology-genomic features and dynamically compensates for incomplete modalities. Experiments on five TCGA datasets demonstrate that our model outperforms advanced methods by over 2.3% in C-Index. Under incomplete modality scenarios, our approach surpasses pathology-only (3.3%) and gene-only models (7.9%). Code: https://github.com/MCPathology/M2Surv

[55] Comparative Performance of Finetuned ImageNet Pre-trained Models for Electronic Component Classification

Yidi Shao,Longfei Zhou,Fangshuo Tang,Xinyi Shi,Dalang Chen,Shengtao Xia

Main category: cs.CV

TL;DR: 本文比较了12种基于ImageNet预训练模型在电子元件分类任务中的性能,发现MobileNet-V2表现最佳(99.95%),而EfficientNet-B0最低(92.26%),验证了预训练模型在电子制造业的实用性。

Details Motivation: 电子元件分类在制造业中至关重要,能显著降低人工成本并推动技术发展。预训练模型(尤其是基于ImageNet的)在图像分类中表现优异,即使数据有限也能取得良好效果。

Contribution: 对12种ImageNet预训练模型在电子元件分类任务中进行了系统的性能比较,为实际应用提供了参考依据。

Method: 使用了12种不同的ImageNet预训练模型进行微调,并在电子元件分类任务中评估其性能。

Result: MobileNet-V2以99.95%的准确率表现最优,EfficientNet-B0以92.26%的准确率表现最差。所有模型均表现出色,证明了预训练模型的有效性。

Insight: ImageNet预训练模型即使在特定的电子元件分类任务中也能取得高准确率,验证了其泛化能力和实用性。

Abstract: Electronic component classification and detection are crucial in manufacturing industries, significantly reducing labor costs and promoting technological and industrial development. Pre-trained models, especially those trained on ImageNet, are highly effective in image classification, allowing researchers to achieve excellent results even with limited data. This paper compares the performance of twelve ImageNet pre-trained models in classifying electronic components. Our findings show that all models tested delivered respectable accuracies. MobileNet-V2 recorded the highest at 99.95%, while EfficientNet-B0 had the lowest at 92.26%. These results underscore the substantial benefits of using ImageNet pre-trained models in image classification tasks and confirm the practical applicability of these methods in the electronics manufacturing sector.

[56] Trajectory Prediction in Dynamic Object Tracking: A Critical Study

Zhongping Dong,Liming Chen,Mohand Tahar Kechadi

Main category: cs.CV

TL;DR: 这篇论文详细分析了动态目标跟踪(DOT)和轨迹预测(TP)方法的最新进展、应用与挑战,涉及多种技术方法,并探讨了其在实际场景中的效果与局限性。

Details Motivation: 动态目标跟踪和轨迹预测技术在多个领域(如自动驾驶、监控、医疗和工业自动化)中具有重要应用价值,但现有方法在泛化性、计算效率、数据依赖性等方面仍存在挑战。

Contribution: 论文系统地评估了多种方法(基于特征、分割、估计和学习的方法),指出了其优缺点,并提出了未来研究方向,如多模态数据整合、语义信息融合和伦理框架开发。

Method: 通过文献综述和分析,论文总结了动态目标跟踪和轨迹预测的不同技术路径及其适用场景。

Result: 研究强调了现有方法的局限性和潜在改进空间,特别是在解决泛化性和计算效率问题上的需求。

Insight: 未来的研究应关注上下文感知系统的开发,同时结合多模态数据和语义信息,并重视伦理和隐私保护问题。

Abstract: This study provides a detailed analysis of current advancements in dynamic object tracking (DOT) and trajectory prediction (TP) methodologies, including their applications and challenges. It covers various approaches, such as feature-based, segmentation-based, estimation-based, and learning-based methods, evaluating their effectiveness, deployment, and limitations in real-world scenarios. The study highlights the significant impact of these technologies in automotive and autonomous vehicles, surveillance and security, healthcare, and industrial automation, contributing to safety and efficiency. Despite the progress, challenges such as improved generalization, computational efficiency, reduced data dependency, and ethical considerations still exist. The study suggests future research directions to address these challenges, emphasizing the importance of multimodal data integration, semantic information fusion, and developing context-aware systems, along with ethical and privacy-preserving frameworks.

[57] Image Segmentation using Chan-Vese Active Contours

Pranav Shenoy K. P

Main category: cs.CV

TL;DR: 论文详细推导并实现了基于Chan-Vese主动轮廓模型的图像分割方法,展示了其在噪声图像和弱边界图像上的优越性能。

Details Motivation: 为了解决传统基于梯度的图像分割方法在噪声和弱边界图像上的局限性,论文提出了一种基于区域强度差异的主动轮廓模型。

Contribution: 1) 提供了Chan-Vese模型的完整数学推导;2) 实现了基于水平集的数值稳定算法;3) 通过实验验证了模型在复杂分割任务中的有效性。

Method: 1) 从Mumford-Shah变分框架推导水平集公式;2) 使用有限差分法和熵上风方案实现数值稳定;3) 引入曲率正则化。

Result: 实验证明模型在医学和合成图像上分割准确,对噪声鲁棒,且优于传统边缘方法。

Insight: Chan-Vese模型通过区域强度差异而非梯度驱动轮廓演化,为复杂图像分割提供了更优的解决方案。

Abstract: This paper presents a comprehensive derivation and implementation of the Chan-Vese active contour model for image segmentation. The model, derived from the Mumford-Shah variational framework, evolves contours based on regional intensity differences rather than image gradients, making it highly effective for segmenting noisy images or images with weak boundaries. We provide a rigorous mathematical derivation of the level set formulation, including detailed treatment of each energy term using the divergence theorem and curve evolution theory. The resulting algorithm is implemented in Python using finite difference methods with special care to numerical stability, including an upwind entropy scheme and curvature-based regularization. Experimental results on medical and synthetic images demonstrate accurate segmentation, robustness to noise, and superior performance compared to classical edge-based methods. This study confirms the suitability of the Chan-Vese model for complex segmentation tasks and highlights its potential for use in real-world imaging applications.

[58] Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation

Jintao Rong,Xin Xie,Xinyi Yu,Linlin Ou,Xinyu Zhang,Chunhua Shen,Dong Gong

Main category: cs.CV

TL;DR: 本文提出了MotionEcho,一种无需训练的运动定制方法,通过自适应测试时蒸馏提升了蒸馏视频生成模型的运动保真度和生成质量。

Details Motivation: 蒸馏视频生成模型在无需训练的设置下难以通过参考视频实现运动定制,现有方法因其加速生成过程和大去噪步长而无法泛化。

Contribution: 提出了MotionEcho框架,利用扩散教师强迫机制在测试时通过端点和插值预测指导学生模型,动态分配计算资源。

Method: 采用高质量慢速教师模型指导快速学生模型的推理,通过端点和插值预测实现高效运动定制,并根据需求动态分配时间步。

Result: 在多种蒸馏视频生成模型和基准数据集上的实验表明,该方法显著提升了运动保真度和生成质量,同时保持高效。

Insight: 通过自适应测试时蒸馏,可以在无需额外训练的情况下,有效利用教师模型的优势实现高质量的运动定制。

Abstract: Distilled video generation models offer fast and efficient synthesis but struggle with motion customization when guided by reference videos, especially under training-free settings. Existing training-free methods, originally designed for standard diffusion models, fail to generalize due to the accelerated generative process and large denoising steps in distilled models. To address this, we propose MotionEcho, a novel training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing. Our approach uses high-quality, slow teacher models to guide the inference of fast student models through endpoint prediction and interpolation. To maintain efficiency, we dynamically allocate computation across timesteps according to guidance needs. Extensive experiments across various distilled video generation models and benchmark datasets demonstrate that our method significantly improves motion fidelity and generation quality while preserving high efficiency. Project page: https://euminds.github.io/motionecho/

[59] Online camera-pose-free stereo endoscopic tissue deformation recovery with tissue-invariant vision-biomechanics consistency

Jiahe Chen,Naoki Tomii,Ichiro Sakuma,Etsuko Kobayashi

Main category: cs.CV

TL;DR: 该论文提出了一种在线相机姿态无关的立体内窥镜组织形变恢复方法,通过组织不变的视觉-生物力学一致性,解决了相机运动、遮挡和大形变等问题,并在不需要估计相机姿态的情况下实现了帧间对齐。

Details Motivation: 现有研究在相机运动、遮挡、大形变和缺乏组织特异性生物力学先验的情况下表现不佳,且依赖离线处理。本文旨在提出一种在线方法,无需相机姿态估计,能稳定恢复组织几何和形变。

Contribution: 1. 提出一种新的组织几何和形变表示方法(3D点及导数图,位移及局部形变图);2. 在相机中心设置下建模,避免相机姿态估计;3. 引入规范图概念,实现在线优化。

Method: 通过优化帧间形变实现帧间对齐,使用6参数描述刚性运动,3参数描述局部形变。引入规范图在线优化组织几何和形变。输入深度和光流,即使部分遮挡或组织移出视野也能稳定工作。

Result: 在非遮挡和遮挡区域的3D重建精度分别为0.37±0.27 mm和0.39±0.21 mm。可估计表面应变分布,用于机械分析。

Insight: 方法展示了在复杂手术场景中的鲁棒性,为手术导航和自主软组织操作提供了新的工具。

Abstract: Tissue deformation recovery based on stereo endoscopic images is crucial for tool-tissue interaction analysis and benefits surgical navigation and autonomous soft tissue manipulation. Previous research suffers from the problems raised from camera motion, occlusion, large tissue deformation, lack of tissue-specific biomechanical priors, and reliance on offline processing. Unlike previous studies where the tissue geometry and deformation are represented by 3D points and displacements, the proposed method models tissue geometry as the 3D point and derivative map and tissue deformation as the 3D displacement and local deformation map. For a single surface point, 6 parameters are used to describe its rigid motion and 3 parameters for its local deformation. The method is formulated under the camera-centric setting, where all motions are regarded as the scene motion with respect to the camera. Inter-frame alignment is realized by optimizing the inter-frame deformation, making it unnecessary to estimate camera pose. The concept of the canonical map is introduced to optimize tissue geometry and deformation in an online approach. Quantitative and qualitative experiments were conducted using in vivo and ex vivo laparoscopic datasets. With the inputs of depth and optical flow, the method stably models tissue geometry and deformation even when the tissue is partially occluded or moving outside the field of view. Results show that the 3D reconstruction accuracy in the non-occluded and occluded areas reaches 0.37$\pm$0.27 mm and 0.39$\pm$0.21 mm in terms of surface distance, respectively. The method can also estimate surface strain distribution during various manipulations as an extra modality for mechanical-based analysis.

[60] Emergence of Text Readability in Vision Language Models

Jaeyoo Park,Sanghyuk Chun,Wonjae Kim,Sangdoo Yun,Bohyung Han

Main category: cs.CV

TL;DR: 该论文研究了视觉语言模型(VLMs)在训练过程中识别图像中文本内容的能力(文本可读性)的涌现现象,发现其与语义理解能力的渐进发展形成对比。

Details Motivation: 探讨视觉语言模型在训练过程中如何逐渐发展出识别图像中文本的能力,以及这种能力与语义理解能力的差异。

Contribution: 揭示了文本可读性在视觉语言模型训练中突然涌现的现象,并指出对比学习在早期更关注通用语义理解,而文本处理能力发展较晚。

Method: 通过分析VLMs的训练过程,观察文本可读性和语义理解能力的发展模式。

Result: 发现文本可读性在训练后期突然出现,而匹配图像与渲染文本的能力发展更慢,表明需要更深的语义整合。

Insight: 研究结果表明,需要针对性的训练策略来加速VLMs的文本理解能力,为优化多模态学习提供了方向。

Abstract: We investigate how the ability to recognize textual content within images emerges during the training of Vision-Language Models (VLMs). Our analysis reveals a critical phenomenon: the ability to read textual information in a given image \textbf{(text readability)} emerges abruptly after substantial training iterations, in contrast to semantic content understanding which develops gradually from the early stages of training. This delayed emergence may reflect how contrastive learning tends to initially prioritize general semantic understanding, with text-specific symbolic processing developing later. Interestingly, the ability to match images with rendered text develops even slower, indicating a deeper need for semantic integration. These findings highlight the need for tailored training strategies to accelerate robust text comprehension in VLMs, laying the groundwork for future research on optimizing multimodal learning.

[61] Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System

Lixuan He,Haoyu Dong,Zhenxing Chen,Yangcheng Yu,Jie Feng,Yong Li

Main category: cs.CV

TL;DR: Mem4Nav是一种分层空间认知长短记忆系统,旨在提升视觉与语言导航(VLN)在复杂城市场景中的表现。它通过结合稀疏八叉树和语义拓扑图,实现了高效的长期和短期记忆管理,显著提升了任务完成率和路径规划能力。

Details Motivation: 现有VLN方法在城市场景中面临两大挑战:模块化方法缺乏统一记忆,端到端方法受限于固定上下文窗口和隐式空间推理。Mem4Nav旨在通过分层记忆系统解决这些问题。

Contribution: 提出Mem4Nav,一种可增强任何VLN骨干网络的分层记忆系统,结合了稀疏八叉树和语义拓扑图,实现了高效的长期和短期记忆管理。

Method: Mem4Nav通过稀疏八叉树索引细粒度体素,语义拓扑图连接高级地标,并利用可逆Transformer嵌入记忆令牌。长期记忆压缩历史观测,短期记忆缓存近期多模态输入,支持实时避障和规划。动态上下文修剪和历史嵌入无损重构是其关键。

Result: 在Touchdown和Map2Seq数据集上,Mem4Nav显著提升了任务完成率(7-13个百分点),缩短了路径距离,nDTW提高了10个百分点以上。消融实验验证了分层地图和双记忆模块的重要性。

Insight: Mem4Nav的成功表明,分层记忆结构和显式空间推理对城市场景导航至关重要,为未来VLN系统的设计提供了新思路。

Abstract: Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.

[62] AMF-MedIT: An Efficient Align-Modulation-Fusion Framework for Medical Image-Tabular Data

Congjing Yu,Jing Ye,Yang Liu,Xiaodong Zhang,Zhiyong Zhang

Main category: cs.CV

TL;DR: AMF-MedIT是一个高效的跨模态医疗数据融合框架,通过自适应调制与融合模块和创新的表格数据编码器FT-Mamba,解决了图像和表格数据融合中的维度差异和噪声问题,并在数据稀缺条件下表现出色。

Details Motivation: 医疗图像和表格数据的多模态分析对临床决策至关重要,但由于特征维度差异和表格数据中的噪声,现有方法在融合效果和数据效率上存在不足。本文提出AMF-MedIT框架以解决这些问题。

Contribution: 1. 提出AMF模块,动态调整模态贡献并解决维度差异;2. 设计FT-Mamba表格编码器,高效处理噪声数据;3. 首次通过可解释性分析探索表格编码器在对比预训练中对图像模态的监督作用。

Method: 1. AMF模块通过调制目标和模态置信比整合先验知识;2. 提出特征掩码、密度和泄漏损失实现调制目标;3. FT-Mamba利用选择性机制处理表格数据噪声;4. 进行可解释性研究。

Result: 实验表明,AMF-MedIT在多模态性能和数据效率间取得优越平衡,且对不完整表格数据具有强适应性。FT-Mamba在特征提取和指导图像注意力模式方面表现出色。

Insight: 1. 调制机制和先验知识的结合是跨模态融合的关键;2. 选择性机制能有效处理医疗表格数据的高噪声;3. 可解释性分析揭示了表格模态对图像模态的监督潜力。

Abstract: Multimodal medical analysis combining image and tabular data has gained increasing attention. However, effective fusion remains challenging due to cross-modal discrepancies in feature dimensions and modality contributions, as well as the noise from high-dimensional tabular inputs. To address these problems, we present AMF-MedIT, an efficient Align-Modulation-Fusion framework for medical image and tabular data integration, particularly under data-scarce conditions. To harmonize dimension discrepancies and dynamically adjust modality contributions, we propose the Adaptive Modulation and Fusion (AMF) module, a novel modulation-based fusion paradigm with a streamlined architecture. We first derive the modulation objectives and introduce a modality confidence ratio, enabling the incorporation of prior knowledge into the fusion process. Then, the feature masks, density and leakage losses are proposed to achieve the modulation objectives. Additionally, we introduce FT-Mamba, a powerful tabular encoder leveraging a selective mechanism to handle noisy medical tabular data efficiently. Furthermore, interpretability studies are conducted to explore how different tabular encoders supervise the imaging modality during contrastive pretraining for the first time. Extensive experiments demonstrate that AMF-MedIT achieves a superior balance between multimodal performance and data efficiency while showing strong adaptability to incomplete tabular data. Interpretability analysis also highlights FT-Mamba’s capabilities in extracting distinct tabular features and guiding the image encoder toward more accurate and flexible attention patterns.

[63] Sampling Matters in Explanations: Towards Trustworthy Attribution Analysis Building Block in Visual Models through Maximizing Explanation Certainty

Róisín Luo,James McDermott,Colm O’Riordan

Main category: cs.CV

TL;DR: 本文通过理论分析和实验证明,指出梯度集成中的样本分布与自然图像分布的对齐程度决定了解释的可信度下限,并提出了一种通过抑制输入特征的半最优采样方法,显著提升了视觉模型的解释能力。

Details Motivation: 现有图像归因分析中,梯度集成通过噪声样本生成特征映射,但其样本分布与自然图像分布的对齐不足,导致解释可信度低。噪声信息还会使神经网络饱和,影响解释效果。

Contribution: 提出了样本分布对齐的理论界限,并通过抑制特征而非添加噪声的方法,实现了一种半最优采样策略,显著提高了归因分析的可信度。

Method: 采用抑制输入特征的方法生成样本,使样本分布更接近自然图像分布,避免了噪声引起的神经网络饱和问题,并通过梯度集成生成更可信的解释。

Result: 在ImageNet数据集上的实验表明,该方法在所有测试模型中均优于现有基线,能够生成更满意的解释。

Insight: 归因分析的可信度与样本分布的对齐程度直接相关;抑制特征而非添加噪声是一种更有效的采样策略,能够避免模型饱和并提升解释质量。

Abstract: Image attribution analysis seeks to highlight the feature representations learned by visual models such that the highlighted feature maps can reflect the pixel-wise importance of inputs. Gradient integration is a building block in the attribution analysis by integrating the gradients from multiple derived samples to highlight the semantic features relevant to inferences. Such a building block often combines with other information from visual models such as activation or attention maps to form ultimate explanations. Yet, our theoretical analysis demonstrates that the extent to the alignment of the sample distribution in gradient integration with respect to natural image distribution gives a lower bound of explanation certainty. Prior works add noise into images as samples and the noise distributions can lead to low explanation certainty. Counter-intuitively, our experiment shows that extra information can saturate neural networks. To this end, building trustworthy attribution analysis needs to settle the sample distribution misalignment problem. Instead of adding extra information into input images, we present a semi-optimal sampling approach by suppressing features from inputs. The sample distribution by suppressing features is approximately identical to the distribution of natural images. Our extensive quantitative evaluation on large scale dataset ImageNet affirms that our approach is effective and able to yield more satisfactory explanations against state-of-the-art baselines throughout all experimental models.

[64] Deblurring in the Wild: A Real-World Dataset from Smartphone High-Speed Videos

Mahdi Mohd Hossain Noki,Syed Mumtahin Mahmud,Prothito Shovon Majumder,Abdul Mohaimen Al Radi,Md. Haider Ali,Md. Mosaddek Khan

Main category: cs.CV

TL;DR: 本文提出了一个基于智能手机慢动作视频的真实世界图像去模糊数据集,模拟了长曝光模糊,包含超过42,000对高分辨率模糊-清晰图像对,规模是现有数据集的10倍。

Details Motivation: 现有去模糊数据集规模小且场景单一,无法反映真实世界模糊的复杂性和多样性,因此需要构建一个更丰富、更具挑战性的数据集。

Contribution: 构建了目前规模最大的真实世界去模糊数据集,包含多样化的室内外场景和多种运动模式,并为去模糊模型的性能评估提供了新基准。

Method: 通过智能手机慢动作视频(240帧/秒)模拟长曝光模糊,平均多帧生成模糊图像,并选择中间帧作为清晰参考。

Result: 测试了多个SOTA去模糊模型,发现性能显著下降,表明该数据集的复杂性和多样性对现有模型提出了挑战。

Insight: 真实世界的模糊更具挑战性,未来去模糊模型需要更强的泛化能力和对复杂场景的适应性。

Abstract: We introduce the largest real-world image deblurring dataset constructed from smartphone slow-motion videos. Using 240 frames captured over one second, we simulate realistic long-exposure blur by averaging frames to produce blurry images, while using the temporally centered frame as the sharp reference. Our dataset contains over 42,000 high-resolution blur-sharp image pairs, making it approximately 10 times larger than widely used datasets, with 8 times the amount of different scenes, including indoor and outdoor environments, with varying object and camera motions. We benchmark multiple state-of-the-art (SOTA) deblurring models on our dataset and observe significant performance degradation, highlighting the complexity and diversity of our benchmark. Our dataset serves as a challenging new benchmark to facilitate robust and generalizable deblurring models.

[65] Stylized Structural Patterns for Improved Neural Network Pre-training

Farnood Salehi,Vandit Sharma,Amirhossein Askari Farsangi,Tunç Ozan Aydın

Main category: cs.CV

TL;DR: 论文提出了一种改进的合成数据生成方法,通过神经分形公式和反向风格化技术,显著提升了合成数据在预训练中的性能,缩减了与真实数据的分布差距。

Details Motivation: 现代计算机视觉模型依赖大量真实图像数据,但收集真实数据存在隐私和法律问题。现有合成数据性能不足,亟需一种更有效的合成数据生成方法。

Contribution: 1. 提出改进的神经分形公式生成合成数据;2. 设计反向风格化技术,提升合成数据的视觉特征;3. 实验验证合成数据在多种任务中表现优于现有方法。

Method: 1. 使用改进的神经分形公式生成基础合成数据;2. 通过反向风格化技术将少量真实图像的视觉特征迁移到合成数据中。使用KID评估分布差距。

Result: 1. 在EDM2扩散模型中,FID降低11%;2. 自编码器重建误差减少20%;3. ViT-S分类模型在ImageNet-100上准确率提升10%以上。

Insight: 反向风格化技术能有效弥补合成数据与真实数据的差距,为数据稀缺场景提供了实用解决方案。

Abstract: Modern deep learning models in computer vision require large datasets of real images, which are difficult to curate and pose privacy and legal concerns, limiting their commercial use. Recent works suggest synthetic data as an alternative, yet models trained with it often underperform. This paper proposes a two-step approach to bridge this gap. First, we propose an improved neural fractal formulation through which we introduce a new class of synthetic data. Second, we propose reverse stylization, a technique that transfers visual features from a small, license-free set of real images onto synthetic datasets, enhancing their effectiveness. We analyze the domain gap between our synthetic datasets and real images using Kernel Inception Distance (KID) and show that our method achieves a significantly lower distributional gap compared to existing synthetic datasets. Furthermore, our experiments across different tasks demonstrate the practical impact of this reduced gap. We show that pretraining the EDM2 diffusion model on our synthetic dataset leads to an 11% reduction in FID during image generation, compared to models trained on existing synthetic datasets, and a 20% decrease in autoencoder reconstruction error, indicating improved performance in data representation. Furthermore, a ViT-S model trained for classification on this synthetic data achieves over a 10% improvement in ImageNet-100 accuracy. Our work opens up exciting possibilities for training practical models when sufficiently large real training sets are not available.

[66] Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning

Pengfei Hao,Shuaibo Li,Hongqiu Wang,Zhizhuo Kou,Junhang Zhang,Guang Yang,Lei Zhu

Main category: cs.CV

TL;DR: 该论文提出了Surgery-R1,一种用于手术视觉问答定位(Surgical-VQLA)的推理多模态大语言模型(MLLM),通过结合监督微调和强化微调,提升了模型在手术场景中的推理能力和解释性。

Details Motivation: 现有的Surgical-VQLA模型缺乏深度推理能力和解释性,限制了其在临床应用中的可靠性。

Contribution: 1. 创建了Surgery-R1-54k数据集,包含视觉问答、定位问答和思维链数据。2. 设计了Surgery-R1,首个用于Surgical-VQLA的推理MLLM,并提出了两阶段微调机制(SFT和RFT)。3. 提出了多模态一致性奖励机制,解决了手术场景中的定位幻觉问题。

Method: 1. 构建Surgery-R1-54k数据集。2. 使用监督微调(SFT)和强化微调(RFT)两阶段方法提升MLLM的推理能力。3. 引入多模态一致性奖励机制优化RFT。

Result: 实验表明,Surgery-R1在Surgical-VQLA任务中优于现有SOTA模型和其他广泛使用的MLLM,验证了其推理能力和方法的有效性。

Insight: 通过引入强化学习和多模态一致性奖励机制,可以显著提升模型在复杂手术场景中的推理和解释能力,为临床应用提供了更可靠的解决方案。

Abstract: In recent years, significant progress has been made in the field of surgical scene understanding, particularly in the task of Visual Question Localized-Answering in robotic surgery (Surgical-VQLA). However, existing Surgical-VQLA models lack deep reasoning capabilities and interpretability in surgical scenes, which limits their reliability and potential for development in clinical applications. To address this issue, inspired by the development of Reasoning Multimodal Large Language Models (MLLMs), we first build the Surgery-R1-54k dataset, including paired data for Visual-QA, Grounding-QA, and Chain-of-Thought (CoT). Then, we propose the first Reasoning MLLM for Surgical-VQLA (Surgery-R1). In our Surgery-R1, we design a two-stage fine-tuning mechanism to enable the basic MLLM with complex reasoning abilities by utilizing supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Furthermore, for an efficient and high-quality rule-based reward system in our RFT, we design a Multimodal Coherence reward mechanism to mitigate positional illusions that may arise in surgical scenarios. Experiment results demonstrate that Surgery-R1 outperforms other existing state-of-the-art (SOTA) models in the Surgical-VQLA task and widely-used MLLMs, while also validating its reasoning capabilities and the effectiveness of our approach. The code and dataset will be organized in https://github.com/FiFi-HAO467/Surgery-R1.

[67] HMSViT: A Hierarchical Masked Self-Supervised Vision Transformer for Corneal Nerve Segmentation and Diabetic Neuropathy Diagnosis

Xin Zhang,Liangxiu Han,Yue Shi,Yanlin Zheng,Alam Uazman,Maryam Ferdousi,Rayaz Malik

Main category: cs.CV

TL;DR: HMSViT提出了一种新颖的分层掩码自监督视觉Transformer,用于角膜神经分割和糖尿病神经病变诊断,通过高效的多尺度特征提取和降低对标记数据的依赖,在性能和计算成本上优于现有方法。

Details Motivation: 糖尿病周围神经病变(DPN)的早期诊断至关重要,但现有方法存在特征提取效率低、依赖手工先验和数据不足的问题。HMSViT旨在解决这些问题,实现高效且鲁棒的诊断。

Contribution: 1. 提出分层掩码自监督视觉Transformer(HMSViT);2. 结合池化分层和双注意力机制;3. 设计块掩码自监督学习框架,减少对标记数据的依赖;4. 实验证明其在分割和诊断任务中优于现有方法。

Method: 1. 使用分层和双注意力机制高效提取多尺度特征;2. 块掩码自监督学习增强特征鲁棒性;3. 多尺度解码器融合分层特征完成分割和分类。

Result: HMSViT在临床CCM数据集上取得61.34%的mIoU(分割)和70.40%的诊断准确率,优于Swin Transformer和HiViT等模型,且参数更少。

Insight: 1. 分层和注意力机制结合显著提升性能;2. 自监督学习对数据稀缺任务至关重要;3. HMSViT在医疗领域具有实际部署潜力。

Abstract: Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corneal nerve segmentation and DPN diagnosis. Unlike existing methods, HMSViT employs pooling-based hierarchical and dual attention mechanisms with absolute positional encoding, enabling efficient multi-scale feature extraction by capturing fine-grained local details in early layers and integrating global context in deeper layers, all at a lower computational cost. A block-masked self supervised learning framework is designed for the HMSViT that reduces reliance on labelled data, enhancing feature robustness, while a multi-scale decoder is used for segmentation and classification by fusing hierarchical features. Experiments on clinical CCM datasets showed HMSViT achieves state-of-the-art performance, with 61.34% mIoU for nerve segmentation and 70.40% diagnostic accuracy, outperforming leading hierarchical models like the Swin Transformer and HiViT by margins of up to 6.39% in segmentation accuracy while using fewer parameters. Detailed ablation studies further reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training. Overall, these comprehensive experiments confirm that HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications.

[68] SceneCrafter: Controllable Multi-View Driving Scene Editing

Zehao Zhu,Yuliang Zou,Chiyu Max Jiang,Bo Sun,Vincent Casser,Xiukun Huang,Jiahao Wang,Zhenpei Yang,Ruiqi Gao,Leonidas Guibas,Mingxing Tan,Dragomir Anguelov

Main category: cs.CV

TL;DR: SceneCrafter是一个可控的多视角驾驶场景编辑模型,通过解决3D一致性、空街道先验学习及配对图像生成等挑战,实现了高质量的自动驾驶场景编辑。

Details Motivation: 在自动驾驶系统开发中,生成真实的驾驶场景模拟至关重要,但纯合成场景缺乏现实基础,难以令人信服。编辑模型可以利用真实驾驶数据,但面临多相机3D一致性、空街道先验学习和配对图像生成等挑战。

Contribution: 提出SceneCrafter,支持多相机捕获场景的3D一致编辑,提出了生成配对数据的框架和基于空街道先验的alpha混合方法,实现了高度真实和可控的编辑。

Method: 基于多视角扩散模型,结合全局编辑的Prompt-to-Prompt框架和局部编辑的alpha混合方法,通过掩码训练和多视角重绘学习空街道先验。

Result: SceneCrafter在真实性、可控性、3D一致性和编辑质量上优于现有基线方法。

Insight: 通过结合生成模型和编辑技术,可以利用真实数据生成高质量模拟场景,为自动驾驶系统开发提供更可靠的测试环境。

Abstract: Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning ``empty street” priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. We also introduce an alpha-blending framework to synthesize data with local edits, leveraging a model trained on empty street priors through novel masked training and multi-view repaint paradigm. SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines.

[69] Visual hallucination detection in large vision-language models via evidential conflict

Tao Huang,Zhekun Liu,Rui Wang,Yang Zhang,Liping Jing

Main category: cs.CV

TL;DR: 该论文提出了一种基于Dempster-Shafer理论(DST)的方法,用于检测大型视觉语言模型(LVLM)中的视觉幻觉现象,并开发了PRE-HAL数据集以系统评估模型的感知和推理能力。

Details Motivation: 尽管LVLM具有强大的多模态能力,但视觉输入与文本输出之间常存在不一致(视觉幻觉现象),这在安全关键型AI应用中带来显著风险。现有评测基准主要关注感知层面,而忽略了由高级推理能力引发的幻觉。

Contribution: 1. 开发了PRE-HAL数据集,支持对LVLM感知和推理能力的系统评估;2. 提出了首个基于DST的视觉幻觉检测方法,通过不确定性估计捕捉高层特征的冲突程度。

Method: 利用DST理论构建简单的mass函数,以减少证据组合的计算复杂度,并通过不确定性估计检测LVLM输出中的冲突。实验在LLaVA-v1.5、mPLUG-Owl2和mPLUG-Owl3等模型上进行。

Result: 提出的方法在PRE-HAL数据集上优于五种基线不确定性指标,在三个LVLM上的平均AUROC分别提升了4%、10%和7%。

Insight: 视觉幻觉不仅源于感知能力不足,还与高级推理能力相关;DST理论在捕捉模型冲突和不确定性方面表现出色,为LVLM的可靠性评估提供了新思路。

Abstract: Despite the remarkable multimodal capabilities of Large Vision-Language Models (LVLMs), discrepancies often occur between visual inputs and textual outputs–a phenomenon we term visual hallucination. This critical reliability gap poses substantial risks in safety-critical Artificial Intelligence (AI) applications, necessitating a comprehensive evaluation benchmark and effective detection methods. Firstly, we observe that existing visual-centric hallucination benchmarks mainly assess LVLMs from a perception perspective, overlooking hallucinations arising from advanced reasoning capabilities. We develop the Perception-Reasoning Evaluation Hallucination (PRE-HAL) dataset, which enables the systematic evaluation of both perception and reasoning capabilities of LVLMs across multiple visual semantics, such as instances, scenes, and relations. Comprehensive evaluation with this new benchmark exposed more visual vulnerabilities, particularly in the more challenging task of relation reasoning. To address this issue, we propose, to the best of our knowledge, the first Dempster-Shafer theory (DST)-based visual hallucination detection method for LVLMs through uncertainty estimation. This method aims to efficiently capture the degree of conflict in high-level features at the model inference phase. Specifically, our approach employs simple mass functions to mitigate the computational complexity of evidence combination on power sets. We conduct an extensive evaluation of state-of-the-art LVLMs, LLaVA-v1.5, mPLUG-Owl2 and mPLUG-Owl3, with the new PRE-HAL benchmark. Experimental results indicate that our method outperforms five baseline uncertainty metrics, achieving average AUROC improvements of 4%, 10%, and 7% across three LVLMs. Our code is available at https://github.com/HT86159/Evidential-Conflict.

[70] ReMAR-DS: Recalibrated Feature Learning for Metal Artifact Reduction and CT Domain Transformation

Mubashara Rehman,Niki Martinel,Michele Avanzo,Riccardo Spizzo,Christian Micheloni

Main category: cs.CV

TL;DR: ReMAR-DS是一种基于深度学习的框架,通过特征重新校准来减少金属伪影并实现kVCT到MVCT的域转换,提升放疗计划质量。

Details Motivation: kVCT成像中的金属伪影降低了图像质量,影响临床决策。传统方法无法有效减少伪影并保持解剖结构。

Contribution: 提出了一种结合特征重新校准的编码器-解码器架构,专注于伪影区域和关键特征,实现了高质量的MVCT重建。

Method: 采用编码器-解码器架构,通过在编码器模块中引入重新校准的特征,专注于相关空间区域和通道特征。

Result: 实现了高质量的MVCT重建,减少了放疗计划中对MVCT扫描的需求,验证了其临床价值。

Insight: 特征重新校准有助于模型专注于关键区域和通道,提升图像重建质量,为临床决策提供更可靠的依据。

Abstract: Artifacts in kilo-Voltage CT (kVCT) imaging degrade image quality, impacting clinical decisions. We propose a deep learning framework for metal artifact reduction (MAR) and domain transformation from kVCT to Mega-Voltage CT (MVCT). The proposed framework, ReMAR-DS, utilizes an encoder-decoder architecture with enhanced feature recalibration, effectively reducing artifacts while preserving anatomical structures. This ensures that only relevant information is utilized in the reconstruction process. By infusing recalibrated features from the encoder block, the model focuses on relevant spatial regions (e.g., areas with artifacts) and highlights key features across channels (e.g., anatomical structures), leading to improved reconstruction of artifact-corrupted regions. Unlike traditional MAR methods, our approach bridges the gap between high-resolution kVCT and artifact-resistant MVCT, enhancing radiotherapy planning. It produces high-quality MVCT-like reconstructions, validated through qualitative and quantitative evaluations. Clinically, this enables oncologists to rely on kVCT alone, reducing repeated high-dose MVCT scans and lowering radiation exposure for cancer patients.

[71] Identifying Physically Realizable Triggers for Backdoored Face Recognition Networks

Ankita Raj,Ambar Pal,Chetan Arora

Main category: cs.CV

TL;DR: 该论文提出了一种新方法,用于检测人脸识别(FR)网络中是否存在自然、物理可实现的触发器,并识别这些触发器。

Details Motivation: 后门攻击通过隐藏功能使深度神经网络在特定输入触发器下表现出异常行为,这对高安全性应用中的人脸识别系统构成严重威胁。

Contribution: 主要贡献是提出了一种能够检测和识别FR网络中物理可实现触发器的技术。

Method: 论文介绍了一种新颖的技术:(1)检测FR网络是否被植入自然触发器;(2)给定被攻击的网络,识别这些触发器。

Result: 实验表明,该方法在识别触发器(如绿色太阳镜或红色帽子)时的Top-5准确率为74%,优于暴力搜索基线的56%。

Insight: 研究揭示了物理可实现触发器对人脸识别系统的潜在威胁,并提供了一种有效的防御手段。

Abstract: Backdoor attacks embed a hidden functionality into deep neural networks, causing the network to display anomalous behavior when activated by a predetermined pattern in the input Trigger, while behaving well otherwise on public test data. Recent works have shown that backdoored face recognition (FR) systems can respond to natural-looking triggers like a particular pair of sunglasses. Such attacks pose a serious threat to the applicability of FR systems in high-security applications. We propose a novel technique to (1) detect whether an FR network is compromised with a natural, physically realizable trigger, and (2) identify such triggers given a compromised network. We demonstrate the effectiveness of our methods with a compromised FR network, where we are able to identify the trigger (e.g., green sunglasses or red hat) with a top-5 accuracy of 74%, whereas a naive brute force baseline achieves 56% accuracy.

[72] General Methods Make Great Domain-specific Foundation Models: A Case-study on Fetal Ultrasound

Jakob Ambsdorf,Asbjørn Munk,Sebastian Llambias,Anders Nymark Christensen,Kamil Mikolaj,Randall Balestriero,Martin Tolsgaard,Aasa Feragen,Mads Nielsen

Main category: cs.CV

TL;DR: 该论文探讨了是否应在特定医学领域(如胎儿超声)训练定制的基础模型,或直接从通用模型进行迁移学习。研究通过大规模胎儿超声数据集的实验表明,定制模型的预训练是值得的,且无需复杂的方法创新,仅需成熟的计算机视觉技术即可实现最优性能。

Details Motivation: 面对大规模、未标注的医学数据,研究者需要决定是训练定制基础模型还是使用通用模型的迁移学习,同时探讨是否需要新的方法。

Contribution: 研究表明:(1) 即使数据规模较小,定制模型的预训练在胎儿超声领域优于通用模型;(2) 成熟的计算机视觉方法(如DINOv2)可直接用于医学领域,无需复杂调整。

Method: 采用DINOv2方法在大规模胎儿超声数据集(2M图像)上进行预训练,并与自然图像、超声图像预训练模型及监督基线进行比较。

Result: 在三个胎儿超声数据集上实现了最优性能,覆盖分类、分割和少样本任务。

Insight: 在医学领域开发基础模型时,无需过度追求方法创新,成熟技术即可满足需求,尤其是在资源受限的情况下。

Abstract: With access to large-scale, unlabeled medical datasets, researchers are confronted with two questions: Should they attempt to pretrain a custom foundation model on this medical data, or use transfer-learning from an existing generalist model? And, if a custom model is pretrained, are novel methods required? In this paper we explore these questions by conducting a case-study, in which we train a foundation model on a large regional fetal ultrasound dataset of 2M images. By selecting the well-established DINOv2 method for pretraining, we achieve state-of-the-art results on three fetal ultrasound datasets, covering data from different countries, classification, segmentation, and few-shot tasks. We compare against a series of models pretrained on natural images, ultrasound images, and supervised baselines. Our results demonstrate two key insights: (i) Pretraining on custom data is worth it, even if smaller models are trained on less data, as scaling in natural image pretraining does not translate to ultrasound performance. (ii) Well-tuned methods from computer vision are making it feasible to train custom foundation models for a given medical domain, requiring no hyperparameter tuning and little methodological adaptation. Given these findings, we argue that a bias towards methodological innovation should be avoided when developing domain specific foundation models under common computational resource constraints.

[73] Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation

Yuanhe Tian,Lei Mao,Yan Song

Main category: cs.CV

TL;DR: 提出了一种基于大语言模型(LLM)的CT报告生成方法,通过循环视觉特征提取和立体注意力机制,实现对CT扫描连续切片间的强相关性建模,生成更准确的报告。

Details Motivation: 现有方法未能显式建模CT切片间的变换关系,且未能有效整合多层次图像特征,尤其是包含特定器官病变的特征。本文旨在利用CT切片的强相关性,提升CT报告生成的准确性和效果。

Contribution: 1. 提出了一种循环视觉特征提取方法,利用Transformer逐层处理CT切片。2. 设计了立体注意力机制,从多视角选择重要视觉信息并与文本特征对齐。3. 在M3D-Cap数据集上实现了SOTA效果。

Method: 采用视觉Transformer循环处理CT切片,结合立体注意力机制对编码后的切片多视角建模,选择关键视觉信息以指导LLM生成报告。

Result: 在M3D-Cap数据集上超越基线模型,取得SOTA效果。

Insight: 通过显式建模CT切片的变换关系和多层次特征,结合注意力机制对齐视觉与文本信息,能够提升CT报告生成的质量。

Abstract: Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where they firstly compress the volume and then divide the compressed CT slices into patches for visual encoding. These approaches do not explicitly account for the transformations among CT slices, nor do they effectively integrate multi-level image features, particularly those containing specific organ lesions, to instruct CT report generation (CTRG). In considering the strong correlation among consecutive slices in CT scans, in this paper, we propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling. Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to selectively obtain important visual information and align them with textual features, so as to better instruct an LLM for CTRG. Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models and achieves state-of-the-art results, demonstrating its validity and effectiveness.

[74] MambaOutRS: A Hybrid CNN-Fourier Architecture for Remote Sensing Image Classification

Minjong Cheon,Changbae Mun

Main category: cs.CV

TL;DR: MambaOutRS是一种结合CNN和傅里叶变换的混合架构,用于遥感图像分类,通过堆叠的门控CNN块和傅里叶滤网关模块实现全局上下文捕获,显著优于现有方法。

Details Motivation: 尽管状态空间模型(SSMs)如Mamba在视觉任务中表现出色,但其在2D视觉数据上的复杂适应性可能降低效率。本文提出了一种更高效的替代方案。

Contribution: 提出了MambaOutRS,一种结合门控CNN和傅里叶滤网关(FFG)的混合架构,验证了通过卷积和频域操作可以高效替代复杂的SSMs。

Method: 使用堆叠的门控CNN块进行局部特征提取,引入傅里叶滤网关模块在频域捕获全局信息,采用四阶段层次化设计。

Result: 在多个遥感数据集上实现SOTA性能,MambaOutRS-t(24.0M参数)在UC Merced和AID上的F1分数分别达到98.41%和95.99%。

Insight: 通过卷积和频域操作的结合可以有效替代复杂的SSMs,为计算效率要求高的视觉任务提供了一种高效范式。

Abstract: Recent advances in deep learning for vision tasks have seen the rise of State Space Models (SSMs) like Mamba, celebrated for their linear scalability. However, their adaptation to 2D visual data often necessitates complex modifications that may diminish efficiency. In this paper, we introduce MambaOutRS, a novel hybrid convolutional architecture for remote sensing image classification that re-evaluates the necessity of recurrent SSMs. MambaOutRS builds upon stacked Gated CNN blocks for local feature extraction and introduces a novel Fourier Filter Gate (FFG) module that operates in the frequency domain to capture global contextual information efficiently. Our architecture employs a four-stage hierarchical design and was extensively evaluated on challenging remote sensing datasets: UC Merced, AID, NWPU-RESISC45, and EuroSAT. MambaOutRS consistently achieved state-of-the-art (SOTA) performance across these benchmarks. Notably, our MambaOutRS-t variant (24.0M parameters) attained the highest F1-scores of 98.41% on UC Merced and 95.99% on AID, significantly outperforming existing baselines, including larger transformer models and Mamba-based architectures, despite using considerably fewer parameters. An ablation study conclusively demonstrates the critical role of the Fourier Filter Gate in enhancing the model’s ability to capture global spatial patterns, leading to robust and accurate classification. These results strongly suggest that the complexities of recurrent SSMs can be effectively superseded by a judicious combination of gated convolutions for spatial mixing and frequency-based gates for spectral global context. Thus, MambaOutRS provides a compelling and efficient paradigm for developing high-performance deep learning models in remote sensing and other vision domains, particularly where computational efficiency is paramount.

[75] SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images

Gencer Sumbul,Chang Xu,Emanuele Dalsasso,Devis Tuia

Main category: cs.CV

TL;DR: SMARTIES 是一种通用的多传感器自动编码器模型,能够将不同遥感传感器的数据投影到一个共享的频谱感知空间,从而实现灵活的跨传感器数据处理。

Details Motivation: 现有深度学习模型往往针对单一传感器或固定组合设计,缺乏对不同传感器输入的灵活性,限制了多传感器遥感数据的可扩展性和泛化能力。

Contribution: 提出了 SMARTIES,一种通用的基础模型,能够通过跨传感器数据重建和频谱感知空间投影,实现多传感器数据的灵活处理和高效特征提取。

Method: 使用统一的 Transformer 模型,通过掩码重建和跨传感器 token 混合方法,训练传感器无关的特征表示。

Result: 在单模态和多模态任务中,SMARTIES 的性能优于依赖传感器特定预训练的模型。

Insight: 频谱感知空间的引入和跨传感器 token 混合是提高模型泛化能力和多传感器适应性的关键。

Abstract: From optical sensors to microwave radars, leveraging the complementary strengths of remote sensing (RS) sensors is crucial for achieving dense spatio-temporal monitoring of our planet. In contrast, recent deep learning models, whether task-specific or foundational, are often specific to single sensors or to fixed combinations: adapting such models to different sensory inputs requires both architectural changes and re-training, limiting scalability and generalization across multiple RS sensors. On the contrary, a single model able to modulate its feature representations to accept diverse sensors as input would pave the way to agile and flexible multi-sensor RS data processing. To address this, we introduce SMARTIES, a generic and versatile foundation model lifting sensor-specific/dependent efforts and enabling scalability and generalization to diverse RS sensors: SMARTIES projects data from heterogeneous sensors into a shared spectrum-aware space, enabling the use of arbitrary combinations of bands both for training and inference. To obtain sensor-agnostic representations, we train a single, unified transformer model reconstructing masked multi-sensor data with cross-sensor token mixup. On both single- and multi-modal tasks across diverse sensors, SMARTIES outperforms previous models that rely on sensor-specific pretraining. Our code and pretrained models are available at https://gsumbul.github.io/SMARTIES.

[76] Vision Transformer-Based Time-Series Image Reconstruction for Cloud-Filling Applications

Lujun Li,Yiqun Wang,Radu State

Main category: cs.CV

TL;DR: 提出一种基于Vision Transformer(ViT)的时间序列图像重建框架,用于填补云覆盖区域的多光谱图像(MSI)数据,结合合成孔径雷达(SAR)的互补信息。

Details Motivation: 云覆盖导致多光谱图像(MSI)数据缺失,影响早期作物分类。SAR数据不受云干扰但缺乏光谱细节,需要一种方法来结合两者优势以重建完整MSI数据。

Contribution: 提出了一种新型的Time-series ViT框架,利用时间序列MSI和SAR数据的互补信息,通过注意力机制填补云覆盖区域的MSI数据。

Method: 基于Vision Transformer(ViT)设计时间序列图像重建框架,利用MSI的时间连贯性和SAR的互补信息,通过注意力机制实现云覆盖区域的图像重建。

Result: 实验表明,Time-series ViT框架在云覆盖区域的MSI图像重建中显著优于仅使用非时间序列MSI和SAR或仅时间序列MSI的基线方法。

Insight: 通过结合时间序列MSI和SAR数据,并利用ViT的注意力机制,可以有效解决云覆盖问题,提升多光谱图像重建的质量和精度。

Abstract: Cloud cover in multispectral imagery (MSI) poses significant challenges for early season crop mapping, as it leads to missing or corrupted spectral information. Synthetic aperture radar (SAR) data, which is not affected by cloud interference, offers a complementary solution, but lack sufficient spectral detail for precise crop mapping. To address this, we propose a novel framework, Time-series MSI Image Reconstruction using Vision Transformer (ViT), to reconstruct MSI data in cloud-covered regions by leveraging the temporal coherence of MSI and the complementary information from SAR from the attention mechanism. Comprehensive experiments, using rigorous reconstruction evaluation metrics, demonstrate that Time-series ViT framework significantly outperforms baselines that use non-time-series MSI and SAR or time-series MSI without SAR, effectively enhancing MSI image reconstruction in cloud-covered regions.

[77] Implementing blind navigation through multi-modal sensing and gait guidance

Feifan Yan,Tianle Zeng,Meixi He

Main category: cs.CV

TL;DR: 本文提出了一种基于步态分析和多模态感知的可穿戴盲导设备,通过实验验证其性能优于传统导盲杖。

Details Motivation: 全球视力障碍人群超过2.2亿,传统导盲工具如导盲杖和导盲犬存在不足,亟需更智能的辅助导航解决方案。

Contribution: 提出了一种创新的可穿戴盲导设备,结合步态引导系统和多模态环境感知,为视障人士提供更高效的导航帮助。

Method: 使用步态相位分析进行行走引导,并利用多模态感知技术获取环境信息,集成到可穿戴设备中。

Result: 实验表明,该设备在室内外的导航性能优于传统导盲杖。

Insight: 步态引导与多模态感知的结合为视障人士导航提供了新的技术路径,展现了智能辅助设备的潜力。

Abstract: By the year 2023, the global population of individuals with impaired vision has surpassed 220 million. People with impaired vision will find it difficult while finding path or avoiding obstacles, and must ask for auxiliary tools for help. Although traditional aids such as guide canes and guide dogs exist, they still have some shortcomings. In this paper, we present our wearable blind guiding device, what perform navigation guidance through our proposed Gait-based Guiding System. Our device innovatively integrates gait phase analysis for walking guide, and in terms of environmental perception, we use multimodal sensing to acquire diverse environment information. During the experiment, we conducted both indoor and outdoor experiments, and compared with the standard guide cane. The result shows superior performance of our device in blind guidance.

[78] Self-Supervised Multimodal NeRF for Autonomous Driving

Gaurav Sharma,Ravi Kothari,Josef Schmid

Main category: cs.CV

TL;DR: 本文提出了一种基于神经辐射场(NeRF)的自监督多模态框架NVSF,用于自动驾驶场景中的静态与动态场景建模,无需3D标注即可实现高效训练和快速收敛。

Details Motivation: 自动驾驶场景中的多模态数据(如LiDAR和相机)需要高效的建模方法,现有方法通常依赖3D标注,限制了其应用范围。本文旨在提出一种自监督框架,解决这一问题。

Contribution: 1. 提出了一种自监督多模态NeRF框架(NVSF);2. 引入启发式图像像素采样以提升训练效率;3. 提出双梯度掩码以保留LiDAR数据的局部特征。

Method: 1. 联合学习时空场景的隐式神经表示;2. 使用启发式采样聚焦信息丰富的像素;3. 采用双梯度掩码优化LiDAR点云特征提取。

Result: 在KITTI-360数据集上的实验表明,NVSF在LiDAR和相机数据上均优于基线模型。

Insight: 自监督方法可显著减少对3D标注的依赖,同时多模态联合学习有助于提升自动驾驶场景的建模精度。

Abstract: In this paper, we propose a Neural Radiance Fields (NeRF) based framework, referred to as Novel View Synthesis Framework (NVSF). It jointly learns the implicit neural representation of space and time-varying scene for both LiDAR and Camera. We test this on a real-world autonomous driving scenario containing both static and dynamic scenes. Compared to existing multimodal dynamic NeRFs, our framework is self-supervised, thus eliminating the need for 3D labels. For efficient training and faster convergence, we introduce heuristic-based image pixel sampling to focus on pixels with rich information. To preserve the local features of LiDAR points, a Double Gradient based mask is employed. Extensive experiments on the KITTI-360 dataset show that, compared to the baseline models, our framework has reported best performance on both LiDAR and Camera domain. Code of the model is available at https://github.com/gaurav00700/Selfsupervised-NVSF

[79] VideoPCDNet: Video Parsing and Prediction with Phase Correlation Networks

Noel José Rodrigues Vicente,Enrique Lehner,Angel Villar-Corrales,Jan Nogga,Sven Behnke

Main category: cs.CV

TL;DR: 论文提出VideoPCDNet,一种无监督的框架,用于对象中心视频分解与预测,通过频域相位相关技术解析视频并预测未来帧。

Details Motivation: 动态环境中视频内容的理解与预测对规划与推理至关重要,但无监督学习对象表示与动态仍具挑战性。

Contribution: 1. 提出无监督的VideoPCDNet框架;2. 结合频域相位相关技术实现对象解析与运动建模;3. 在多个合成数据集上超越基线模型。

Method: 使用频域相位相关技术递归解析视频为对象组件,并通过轻量学习模块建模对象运动。

Result: 在无监督跟踪和预测任务中表现优于基线模型,学习到可解释的对象与运动表示。

Insight: 频域技术为无监督对象分解和预测提供了一种高效且可解释的方法。

Abstract: Understanding and predicting video content is essential for planning and reasoning in dynamic environments. Despite advancements, unsupervised learning of object representations and dynamics remains challenging. We present VideoPCDNet, an unsupervised framework for object-centric video decomposition and prediction. Our model uses frequency-domain phase correlation techniques to recursively parse videos into object components, which are represented as transformed versions of learned object prototypes, enabling accurate and interpretable tracking. By explicitly modeling object motion through a combination of frequency domain operations and lightweight learned modules, VideoPCDNet enables accurate unsupervised object tracking and prediction of future video frames. In our experiments, we demonstrate that VideoPCDNet outperforms multiple object-centric baseline models for unsupervised tracking and prediction on several synthetic datasets, while learning interpretable object and motion representations.

[80] PEVLM: Parallel Encoding for Vision-Language Models

Letian Kang,Shixian Luo,Yiqiang Li,Xiaoyang Yu,Shenxuan Zhou,Yong Wu

Main category: cs.CV

TL;DR: PEVLM是一种并行编码策略,旨在提高视觉语言模型(VLMs)的预填充效率,解决了长视频理解中标准注意力机制的高计算复杂度问题。

Details Motivation: 视觉语言模型在视频-语言任务中表现优异,但在长视频理解中的应用因标准注意力机制的二次计算复杂度受限。

Contribution: 提出PEVLM,通过分区输入和共享sink减少注意力计算复杂度,同时保持高精度。

Method: 将输入分块处理,保留全注意力位置嵌入,通过对齐注意力权重模拟全注意力分布。

Result: 在LongVideoBench基准测试中,PEVLM实现8.37%的精度提升和7.47倍的计算加速,端到端延迟降低40%。

Insight: PEVLM适用于低延迟、长上下文视频理解,有望在自动驾驶等实际应用中发挥作用。

Abstract: Vision-Language Models (VLMs) have demonstrated strong performance in video-language tasks, yet their application to long video understanding remains constrained by the quadratic complexity of standard attention mechanisms. In this paper, we propose \textbf{PEVLM}, a parallel encoding strategy specifically designed to improve the prefill efficiency of VLMs without requiring model finetuning. PEVLM partitions the input into block-wise segments with a shared sink, preserves full-attention positional embeddings, and aligns attention weights to mimic full-attention distributions. This design reduces attention computation from $O((T \times N)^2)$ to $O(T \times N)$ while maintaining high accuracy. Extensive experiments on the LongVideoBench benchmark show that PEVLM achieves up to 8.37% accuracy improvement over existing inference-efficient methods and delivers up to 7.47x speedup in attention computation and 40% reduction in end-to-end latency. Under strict latency constraints, PEVLM significantly outperforms baselines, raising accuracy from 23.26% to 61.03%. These results highlight PEVLM’s effectiveness for low-latency, long-context video understanding, making it well-suited for real-world applications such as autonomous driving.

[81] Video Compression for Spatiotemporal Earth System Data

Oscar J. Pellicer-Valero,Cesar Aybar,Gustau Camps Valls

Main category: cs.CV

TL;DR: 论文提出了一种名为xarrayvideo的Python库,用于通过视频压缩技术高效压缩多通道时空地球系统数据,实现了高达250倍的压缩比,同时保持高保真度,适用于深度学习任务。

Details Motivation: 随着地球观测数据规模的快速增长,传统存储和处理方法面临挑战。论文利用视频压缩技术解决这一瓶颈,降低数据存储和传输成本。

Contribution: 主要贡献包括开发了xarrayvideo库,展示了其在多种真实数据集上的高效压缩能力,并开源了两个数据集,为地球科学社区提供了实用工具。

Method: 利用标准视频编解码器(如ffmpeg)压缩多通道时空数据,将其编码为视频格式,从而利用时空和频谱冗余性。

Result: 在多个数据集上实现了高达250倍的压缩比,PSNR表现优异(55.86至65.91 dB),且压缩数据在深度学习中无性能损失。

Insight: 视频压缩技术可有效应用于地球系统数据,显著降低存储需求,同时不影响深度学习任务的性能。

Abstract: Large-scale Earth system datasets, from high-resolution remote sensing imagery to spatiotemporal climate model outputs, exhibit characteristics analogous to those of standard videos. Their inherent spatial, temporal, and spectral redundancies can thus be readily exploited by established video compression techniques. Here, we present xarrayvideo, a Python library for compressing multichannel spatiotemporal datasets by encoding them as videos. Our approach achieves compression ratios of up to 250x while maintaining high fidelity by leveraging standard, well-optimized video codecs through ffmpeg. We demonstrate the library’s effectiveness on four real-world multichannel spatiotemporal datasets: DynamicEarthNet (very high resolution Planet images), DeepExtremeCubes (high resolution Sentinel-2 images), ERA5 (weather reanalysis data), and the SimpleS2 dataset (high resolution multichannel Sentinel-2 images), achieving Peak Signal-to-Noise Ratios (PSNRs) of 55.86, 40.60, 46.58, and 43.23 dB at 0.1 bits per pixel per band (bpppb) and 65.91, 54.28, 62.90, and 55.04 dB at 1 bpppb. We are redistributing two of these datasets, DeepExtremeCubes (2.3 Tb) and DynamicEarthNet (525 Gb), in the machine-learning-ready and cloud-ready TACO format through HuggingFace at significantly reduced sizes (270 Gb and 8.5 Gb, respectively) without compromising quality (PSNR 55.77-56.65 and 60.15). No performance loss is observed when the compressed versions of these datasets are used in their respective deep learning-based downstream tasks (next step reflectance prediction and landcover segmentation). In conclusion, xarrayvideo presents an efficient solution for handling the rapidly growing size of Earth observation datasets, making advanced compression techniques accessible and practical to the Earth science community. The library is available for use at https://github.com/IPL-UV/xarrayvideo

[82] ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Long Xing,Qidong Huang,Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Jinsong Li,Shuangrui Ding,Weiming Zhang,Nenghai Yu,Jiaqi Wang,Feng Wu,Dahua Lin

Main category: cs.CV

TL;DR: ScaleCap是一种推理时可扩展的图像字幕生成策略,通过双模态去偏解决视觉语言模型的内在偏差问题,生成更全面和详细的图像描述。

Details Motivation: 现有视觉语言模型(LVLM)在多模态和语言层面存在固有偏差,导致图像描述不均衡或产生虚构对象描述,亟需一种可扩展的方法提升描述质量。

Contribution: 1. 提出ScaleCap,通过启发式问答和对比句子评分的双模态去偏策略动态优化图像描述;2. 在11个基准测试中验证了ScaleCap的性能提升;3. 展示了生成描述的丰富性和准确性。

Method: 1. 启发式问答:基于图像生成内容特定问题并回答,逐步丰富描述;2. 对比句子评分:通过离线对比解码消除语言偏差导致的虚构描述。

Result: 实验表明,ScaleCap生成的描述更准确、均衡且信息丰富,在VQA和图像重建任务中表现优异。使用ScaleCap标注的450K图像进一步提升了LVLM的预训练性能。

Insight: 通过逐步增加推理成本动态优化描述,ScaleCap有效解决了视觉语言模型的固有偏差问题,为高质量图像描述提供了新思路。

Abstract: This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.

[83] SAM2-SGP: Enhancing SAM2 for Medical Image Segmentation via Support-Set Guided Prompting

Yang Xing,Jiong Wu,Yuheng Bu,Kuang Gong

Main category: cs.CV

TL;DR: 该论文提出了一种名为SAM2-SGP的新框架,通过支持集引导提示技术解决了SAM2在医学图像分割中的手动提示依赖和领域转移问题。

Details Motivation: SAM2虽然在零样本图像分割方面表现优异,但在医学图像分割任务中仍需手动提供提示,且存在领域转移问题,限制了其性能。

Contribution: 1. 提出无需手动提示的SAM2-SGP框架;2. 引入伪掩模生成(PMG)和伪掩模注意力(PMA)模块;3. 采用低秩适应(LoRA)策略缓解领域转移问题。

Method: 1. 使用支持集中的图像-掩模对生成伪掩模(PMG模块);2. 通过PMA模块自动生成边界框并增强局部特征提取;3. 使用LoRA策略适应医学图像领域。

Result: 在多种医学影像模态上(如X射线、CT、MRI等)显著优于当前最先进模型(如nnUNet、SwinUNet)和基础模型(如SAM2、MedSAM2)。

Insight: 通过自动化提示生成和领域适应策略,SAM2-SGP为医学图像分割任务提供了高效且无需人工干预的解决方案。

Abstract: Although new vision foundation models such as Segment Anything Model 2 (SAM2) have significantly enhanced zero-shot image segmentation capabilities, reliance on human-provided prompts poses significant challenges in adapting SAM2 to medical image segmentation tasks. Moreover, SAM2’s performance in medical image segmentation was limited by the domain shift issue, since it was originally trained on natural images and videos. To address these challenges, we proposed SAM2 with support-set guided prompting (SAM2-SGP), a framework that eliminated the need for manual prompts. The proposed model leveraged the memory mechanism of SAM2 to generate pseudo-masks using image-mask pairs from a support set via a Pseudo-mask Generation (PMG) module. We further introduced a novel Pseudo-mask Attention (PMA) module, which used these pseudo-masks to automatically generate bounding boxes and enhance localized feature extraction by guiding attention to relevant areas. Furthermore, a low-rank adaptation (LoRA) strategy was adopted to mitigate the domain shift issue. The proposed framework was evaluated on both 2D and 3D datasets across multiple medical imaging modalities, including fundus photography, X-ray, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and ultrasound. The results demonstrated a significant performance improvement over state-of-the-art models, such as nnUNet and SwinUNet, as well as foundation models, such as SAM2 and MedSAM2, underscoring the effectiveness of the proposed approach. Our code is publicly available at https://github.com/astlian9/SAM_Support.

[84] Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance

Xuesong Li,Dianye Huang,Yameng Zhang,Nassir Navab,Zhongliang Jiang

Main category: cs.CV

TL;DR: 论文提出了一种基于场景图(SG)的超声波图像解释和扫描引导方法,利用基于Transformer的一阶段方法生成SG,并通过大型语言模型(LLM)进一步细化SG以提供易懂的解释。该方法还探索了SG在引导超声波扫描缺失解剖结构方面的潜力。

Details Motivation: 由于超声波图像的视觉变异性大,非专家用户(如即时医疗场景中的用户)对其解释和扫描指导的需求尚未被充分探索。

Contribution: 1) 引入超声波图像的场景图(SG)来解释图像内容并提供扫描指导;2) 提出基于Transformer的一阶段SG生成方法;3) 利用LLM根据用户查询细化SG;4) 探索SG在指导扫描缺失解剖结构方面的潜力。

Method: 1) 使用基于Transformer的一阶段方法生成超声波图像的SG;2) 通过LLM根据用户查询细化SG;3) 利用SG指导扫描缺失的解剖结构。

Result: 在颈部(颈动脉和甲状腺)的五名志愿者图像上验证了方法的有效性,显示出提升超声波解释性和可用性的潜力。

Insight: 场景图和LLM的结合为非专家用户提供了一种直观的超声波解释和扫描指导方法,有助于推广超声波技术的使用。

Abstract: Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for improved ultrasound interpretability and basic scanning guidance among non-expert users, e.g., in point-of-care settings, has not yet been explored. In this study, we first introduce the scene graph (SG) for ultrasound images to explain image content to ordinary and provide guidance for ultrasound scanning. The ultrasound SG is first computed using a transformer-based one-stage method, eliminating the need for explicit object detection. To generate a graspable image explanation for ordinary, the user query is then used to further refine the abstract SG representation through LLMs. Additionally, the predicted SG is explored for its potential in guiding ultrasound scanning toward missing anatomies within the current imaging view, assisting ordinary users in achieving more standardized and complete anatomical exploration. The effectiveness of this SG-based image explanation and scanning guidance has been validated on images from the left and right neck regions, including the carotid and thyroid, across five volunteers. The results demonstrate the potential of the method to maximally democratize ultrasound by enhancing its interpretability and usability for ordinaries.

[85] UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation

Yue Zhou,Yuan Bi,Wenjuan Tong,Wei Wang,Nassir Navab,Zhongliang Jiang

Main category: cs.CV

TL;DR: UltraAD提出了一种基于视觉语言模型的方法,通过少量样本实现超声图像中的细粒度异常分类,解决了领域差异问题。

Details Motivation: 医疗图像中的异常检测需要细粒度分类,现有方法难以区分如良性/恶性肿瘤;超声图像对设备和参数敏感,导致显著的领域差异。UltraAD旨在解决这些问题。

Contribution: 1. 提出基于VLM的UltraAD方法,通过少量样本实现异常定位和细粒度分类;2. 设计图像-文本融合机制提升定位性能;3. 构建记忆库优化分类。

Method: 1. 融合视觉原型与可学习文本嵌入,生成图像引导提示;2. 结合块级token优化局部表示;3. 利用带文本描述的少量样本构建记忆库,对齐医疗数据。

Result: 在三个乳腺超声数据集上超越现有方法,定位和分类性能均显著提升。

Insight: 引入文本信息与视觉特征对齐是解决医疗领域细粒度分类的有效途径,少量样本结合VLM可弥补领域差异。

Abstract: Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance.

[86] Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images

Stephanie Käs,Sven Peter,Henrik Thillmann,Anton Burenko,David Benjamin Adrian,Dennis Mack,Timm Linder,Bastian Leibe

Main category: cs.CV

TL;DR: 该论文系统地比较了不同投影方法在单目鱼眼图像上用于3D人体姿态估计的效果,发现双球面模型等方法显著提高了准确性,并提出了一种基于检测边界框选择投影模型的启发式方法。

Details Motivation: 鱼眼相机在机器人应用中具有更广的视野(FOV),但鱼眼镜头的曲率失真使得人体姿态估计更具挑战性。目前尚无系统评估不同投影方法在单目鱼眼图像中的效果,特别是对于宽FOV姿态的估计。

Contribution: 1. 系统评估了不同投影方法(如小孔模型、双球面模型等)在鱼眼图像中的效果。2. 提出了一种基于检测边界框选择最佳投影模型的启发式方法。3. 发布了包含多样化鱼眼图像的新数据集FISHnCHIPS。

Method: 通过实验比较了多种投影方法在单目鱼眼图像中的表现,包括小孔模型、等距模型、双球面模型和圆柱投影方法,提出了一种基于边界框的投影模型选择方法。

Result: 研究发现,在近距离场景中,小孔模型效果不佳,而双球面模型显著提升了3D姿态估计的准确性。最佳投影方法的选择取决于人体姿态的FOV范围。

Insight: 鱼眼图像的3D姿态估计需根据场景的动态范围选择投影方法,双球面等高级模型在宽FOV场景中表现优异。

Abstract: Fisheye cameras offer robots the ability to capture human movements across a wider field of view (FOV) than standard pinhole cameras, making them particularly useful for applications in human-robot interaction and automotive contexts. However, accurately detecting human poses in fisheye images is challenging due to the curved distortions inherent to fisheye optics. While various methods for undistorting fisheye images have been proposed, their effectiveness and limitations for poses that cover a wide FOV has not been systematically evaluated in the context of absolute human pose estimation from monocular fisheye images. To address this gap, we evaluate the impact of pinhole, equidistant and double sphere camera models, as well as cylindrical projection methods, on 3D human pose estimation accuracy. We find that in close-up scenarios, pinhole projection is inadequate, and the optimal projection method varies with the FOV covered by the human pose. The usage of advanced fisheye models like the double sphere model significantly enhances 3D human pose estimation accuracy. We propose a heuristic for selecting the appropriate projection model based on the detection bounding box to enhance prediction quality. Additionally, we introduce and evaluate on our novel dataset FISHnCHIPS, which features 3D human skeleton annotations in fisheye images, including images from unconventional angles, such as extreme close-ups, ground-mounted cameras, and wide-FOV poses, available at: https://www.vision.rwth-aachen.de/fishnchips

[87] CoCo4D: Comprehensive and Complex 4D Scene Generation

Junwei Zhou,Xueting Li,Lu Qi,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: CoCo4D提出了一种从文本提示生成动态4D场景的框架,通过分解动态前景和静态背景的生成任务,利用参考运动序列和渐进式外绘方法,实现多视角一致且沉浸式的4D场景合成。

Details Motivation: 现有4D合成方法多局限于对象级生成或视角有限的动态场景,无法生成多视角一致且沉浸式的动态4D场景,因此需要一种更全面的解决方案。

Contribution: 1. 提出了一种全新的框架CoCo4D,能够通过文本提示生成复杂的动态4D场景;2. 将4D场景分解为动态前景和静态背景的独立生成任务;3. 引入渐进式外绘方法和参数化轨迹优化,确保场景一致性。

Method: 1. 利用视频扩散模型生成初始运动序列;2. 通过渐进式外绘方法分别合成动态前景和静态背景;3. 优化前景对象的参数化轨迹,实现与背景的自然融合。

Result: 实验表明,CoCo4D在4D场景生成任务中表现优于或与现有方法相当,验证了其有效性和高效性。

Insight: 通过分解场景生成任务并结合运动序列引导,能够显著提升动态4D场景的生成质量和一致性。

Abstract: Existing 4D synthesis methods primarily focus on object-level generation or dynamic scene synthesis with limited novel views, restricting their ability to generate multi-view consistent and immersive dynamic 4D scenes. To address these constraints, we propose a framework (dubbed as CoCo4D) for generating detailed dynamic 4D scenes from text prompts, with the option to include images. Our method leverages the crucial observation that articulated motion typically characterizes foreground objects, whereas background alterations are less pronounced. Consequently, CoCo4D divides 4D scene synthesis into two responsibilities: modeling the dynamic foreground and creating the evolving background, both directed by a reference motion sequence. Given a text prompt and an optional reference image, CoCo4D first generates an initial motion sequence utilizing video diffusion models. This motion sequence then guides the synthesis of both the dynamic foreground object and the background using a novel progressive outpainting scheme. To ensure seamless integration of the moving foreground object within the dynamic background, CoCo4D optimizes a parametric trajectory for the foreground, resulting in realistic and coherent blending. Extensive experiments show that CoCo4D achieves comparable or superior performance in 4D scene generation compared to existing methods, demonstrating its effectiveness and efficiency. More results are presented on our website https://colezwhy.github.io/coco4d/.

[88] Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router

Yubo Huang,Weiqiang Wang,Sirui Zhao,Tong Xu,Lin Liu,Enhong Chen

Main category: cs.CV

TL;DR: Bind-Your-Avatar提出了一种基于MM-DiT的模型,用于生成同一场景中多角色对话视频,解决了音频与角色对应控制和数据缺乏问题。

Details Motivation: 现有方法主要针对单角色说话头生成,而多角色在同一场景中的对话视频生成面临音频与角色对应控制及数据缺乏的挑战。

Contribution: 1. 提出嵌入路由器框架绑定角色和语音;2. 实现3D掩码嵌入路由器;3. 构建首个多角色对话数据集;4. 建立双角色视频生成基准。

Method: 采用MM-DiT模型,结合嵌入路由器和3D掩码技术,通过几何先验和掩码优化策略控制角色生成。

Result: 实验表明,该方法在多角色视频生成中优于现有技术,实现了更准确的音频角色对应和更流畅的生成效果。

Insight: 3D掩码和几何先验的结合为多角色视频生成提供了细粒度控制和时序平滑性,数据集和基准的构建推动了该领域的研究。

Abstract: Recent years have witnessed remarkable advances in audio-driven talking head generation. However, existing approaches predominantly focus on single-character scenarios. While some methods can create separate conversation videos between two individuals, the critical challenge of generating unified conversation videos with multiple physically co-present characters sharing the same spatial environment remains largely unaddressed. This setting presents two key challenges: audio-to-character correspondence control and the lack of suitable datasets featuring multi-character talking videos within the same scene. To address these challenges, we introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene. Specifically, we propose (1) A novel framework incorporating a fine-grained Embedding Router that binds who' and speak what’ together to address the audio-to-character correspondence control. (2) Two methods for implementing a 3D-mask embedding router that enables frame-wise, fine-grained control of individual characters, with distinct loss functions based on observed geometric priors and a mask refinement strategy to enhance the accuracy and temporal smoothness of the predicted masks. (3) The first dataset, to the best of our knowledge, specifically constructed for multi-talking-character video generation, and accompanied by an open-source data processing pipeline, and (4) A benchmark for the dual-talking-characters video generation, with extensive experiments demonstrating superior performance over multiple state-of-the-art methods.

[89] SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution

Liangbin Xie,Yu Li,Shian Du,Menghan Xia,Xintao Wang,Fanghua Yu,Ziyan Chen,Pengfei Wan,Jiantao Zhou,Chao Dong

Main category: cs.CV

TL;DR: 本文提出了一种名为SimpleGVR的简单基准方法,用于研究级联视频超分辨率(VSR)模型的关键设计原则,通过两阶段的解耦处理实现高效的高分辨率视频生成。

Details Motivation: 随着用户对高分辨率视频的需求增加,仅依赖潜在空间计算的方法已经不足。通过将过程解耦为语义内容生成和细节合成两阶段,可以更高效地实现高质量输出。

Contribution: 1. 提出两种降级策略以生成更符合基模型输出的训练数据;2. 通过系统分析为VSR模型的行为提供关键见解;3. 引入交错时序单元和稀疏局部注意力以实现高效训练和推理。

Method: 采用两阶段级联方法:基模型生成低分辨率语义内容,轻量级VSR模型负责细节合成。通过降级策略、时间采样分析和注意力机制优化设计。

Result: 实验表明该方法优于现有技术,消融研究验证了各设计选择的有效性,显著降低了计算开销。

Insight: 级联VSR模型的设计应注重与基模型的输出对齐,时间采样和噪声增强策略对模型性能有重要影响。

Abstract: Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.

[90] Improving Progressive Generation with Decomposable Flow Matching

Moayed Haji-Ali,Willi Menapace,Ivan Skorokhodov,Arpit Sahni,Sergey Tulyakov,Vicente Ordonez,Aliaksandr Siarohin

Main category: cs.CV

TL;DR: 论文提出了Decomposable Flow Matching(DFM)框架,通过在多尺度表示上独立应用Flow Matching,简化了渐进式生成视觉媒体的方法,并在图像和视频生成中取得了更好的效果。

Details Motivation: 现有的渐进式生成方法通常依赖复杂的多阶段架构,增加了整体复杂性,需要定制化的扩散公式或采样器。DFM旨在提供一种简单高效的解决方案。

Contribution: DFM框架无需复杂架构或定制化设计,通过独立应用Flow Matching实现渐进式生成,显著提升了视觉质量。

Method: DFM在多尺度表示(如拉普拉斯金字塔)的每一层独立应用Flow Matching,简化了渐进生成过程。

Result: 在Imagenet-1k 512px上,DFM的FDD分数比基线模型提升了35.2%,收敛速度也更快。

Insight: DFM展示了通过简单架构和最小化修改,可以在渐进生成任务中大幅提升性能,为视觉生成提供了新思路。

Abstract: Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.

[91] GenHSI: Controllable Generation of Human-Scene Interaction Videos

Zekun Li,Rui Zhou,Rahul Sajnani,Xiaoyan Cong,Daniel Ritchie,Srinath Sridhar

Main category: cs.CV

TL;DR: GenHSI 是一种免训练方法,用于可控生成长时间的人类-场景交互视频(HSI),通过三阶段流程(脚本编写、预可视化、动画)解决了现有方法在人类身份保存和真实交互上的挑战。

Details Motivation: 现有的大规模预训练视频扩散模型在生成多样化视频方面表现出色,但在生成长时间、电影式视频时仍面临人类-场景交互不真实、身份保存不足和高昂训练成本等问题。

Contribution: 提出了首个免训练方法 GenHSI,能够从单张场景图像生成具有一致性相机视角和丰富人类-场景交互的长视频,且无需精确扫描场景。

Method: 通过三阶段流程(脚本编写、预可视化、动画),将复杂任务分解为原子任务,利用 3D 关键帧和现成视频扩散模型生成一致性长视频。

Result: 实验表明,GenHSI 能有效保持场景内容和人物身份,并生成逼真的人类-场景交互视频。

Insight: 借鉴电影动画的三阶段流程是解决长视频生成中身份保存和交互真实性的有效方法,且免训练的方式降低了实现成本。

Abstract: Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in using these models to generate long movie-like videos with rich human-object interactions that include unrealistic human-scene interaction, lack of subject identity preservation, and require expensive training. We propose GenHSI, a training-free method for controllable generation of long human-scene interaction videos (HSI). Taking inspiration from movie animation, our key insight is to overcome the limitations of previous work by subdividing the long video generation task into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene, a user description, and multiple images of a person, we use these three stages to generate long-videos that preserve human-identity and provide rich human-scene interactions. Script writing converts complex human tasks into simple atomic tasks that are used in the pre-visualization stage to generate 3D keyframes (storyboards). These 3D keyframes are rendered and animated by off-the-shelf video diffusion models for consistent long video generation with rich contacts in a 3D-aware manner. A key advantage of our work is that we alleviate the need for scanned, accurate scenes and create 3D keyframes from single-view images. We are the first to generate a long video sequence with a consistent camera pose that contains arbitrary numbers of character actions without training. Experiments demonstrate that our method can generate long videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene. Visit our project homepage https://kunkun0w0.github.io/project/GenHSI/ for more information.

[92] A Comparative Study of NAFNet Baselines for Image Restoration

Vladislav Esaulov,M. Moein Esfahani

Main category: cs.CV

TL;DR: 该论文对NAFNet(非线性激活自由网络)在图像修复任务中的核心组件进行了消融实验,验证了SimpleGate激活、简化通道注意力(SCA)和层归一化(LayerNorm)的有效性。

Details Motivation: 研究旨在验证NAFNet设计的合理性,通过比较不同组件的变体,确定对图像修复性能的关键影响。

Contribution: 论文的主要贡献是通过实验验证了NAFNet中SimpleGate激活和简化注意力机制的优势,同时强调了层归一化对训练稳定性的重要性。

Method: 研究方法包括对CIFAR10损坏图像的修复任务,比较不同NAFNet变体(替换或移除核心组件)的性能,使用PSNR和SSIM指标。

Result: 实验结果显示,SimpleGate和简化注意力机制优于传统方法,LayerNorm能提升训练稳定性。

Insight: 论文的见解是简化网络设计(如SimpleGate和SCA)在计算效率和性能上均优于复杂结构,层归一化是稳定训练的关键因素。

Abstract: We study NAFNet (Nonlinear Activation Free Network), a simple and efficient deep learning baseline for image restoration. By using CIFAR10 images corrupted with noise and blur, we conduct an ablation study of NAFNet’s core components. Our baseline model implements SimpleGate activation, Simplified Channel Activation (SCA), and LayerNormalization. We compare this baseline to different variants that replace or remove components. Quantitative results (PSNR, SSIM) and examples illustrate how each modification affects restoration performance. Our findings support the NAFNet design: the SimpleGate and simplified attention mechanisms yield better results than conventional activations and attention, while LayerNorm proves to be important for stable training. We conclude with recommendations for model design, discuss potential improvements, and future work.

[93] Unified Vision-Language-Action Model

Yuqi Wang,Xinghang Li,Wenxuan Wang,Junbo Zhang,Yingyan Li,Yuntao Chen,Xinlong Wang,Zhaoxiang Zhang

Main category: cs.CV

TL;DR: UniVLA是一种新型的多模态视觉-语言-动作(VLA)模型,通过将视觉、语言和动作信号建模为离散令牌序列,实现了灵活的多模态任务学习,并在多个仿真基准测试中取得了最先进的性能。

Details Motivation: 现有的视觉-语言-动作模型大多依赖视觉-语言模型的通用理解能力生成动作信号,忽略了视觉观察中丰富的时间和因果结构,因此需要一种统一的模型来更好地捕捉这些动态。

Contribution: 提出了UniVLA,一种自回归建模视觉、语言和动作信号的多模态模型,能够通过后训练捕捉视频中的因果动态,显著提升了长期任务的表现。

Method: UniVLA将视觉、语言和动作信号统一建模为离散令牌序列,并在后训练过程中融入世界建模以捕捉动态,支持灵活的多模态任务学习。

Result: 在CALVIN、LIBERO和Simplenv-Bridge等基准测试中,UniVLA取得了最先进的性能(如LIBERO上95.5%的平均成功率),并在真实世界的ALOHA机械臂操作和自动驾驶中展现了广泛应用。

Insight: 通过统一的模型架构和世界建模,UniVLA不仅提升了多模态任务的灵活性,还为长期任务的策略学习提供了有效的因果动态捕捉能力。

Abstract: Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning–especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST’s 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.

[94] AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models

Zehuan Huang,Haoran Feng,Yangtian Sun,Yuanchen Guo,Yanpei Cao,Lu Sheng

Main category: cs.CV

TL;DR: AnimaX是一个基于视频-姿态联合扩散模型的3D动画框架,通过结合视频的运动先验和骨架动画的可控结构,实现高效的3D动画生成。

Details Motivation: 传统运动合成方法局限于固定骨架结构或需要高维变形空间的高成本优化,AnimaX旨在通过视频扩散模型与骨架动画的结合,实现更灵活高效的3D动画生成。

Contribution: 1. 提出了AnimaX框架,将视频运动先验引入3D动画领域;2. 通过多视图多帧2D姿态图表示3D运动;3. 引入了共享位置编码和模态感知嵌入,确保视频与姿态序列的时空对齐。

Method: 1. 将3D运动表示为多视图、多帧2D姿态图;2. 使用联合视频-姿态扩散模型,基于模板渲染和文本运动提示生成动画;3. 通过三角化将多视图姿态序列转换为3D关节点位置。

Result: 在VBench基准测试中,AnimaX在泛化性、运动保真度和效率方面达到了state-of-the-art水平。

Insight: 通过结合视频扩散模型的运动先验和骨架动画的可控性,AnimaX为类别无关的3D动画提供了一种可扩展的解决方案。

Abstract: We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation. Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces. In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt. We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics. Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation. Project page: \href{https://anima-x.github.io/}{https://anima-x.github.io/}.

[95] Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

Xingyang Li,Muyang Li,Tianle Cai,Haocheng Xi,Shuo Yang,Yujun Lin,Lvmin Zhang,Songlin Yang,Jinbo Hu,Kelly Peng,Maneesh Agrawala,Ion Stoica,Kurt Keutzer,Song Han

Main category: cs.CV

TL;DR: 论文提出了一种名为径向注意力(Radial Attention)的稀疏注意力机制,通过模拟时空能量衰减现象,显著降低了长视频生成的计算复杂度,同时保持了视频质量。

Details Motivation: 视频扩散模型在高质量视频生成方面取得了进展,但时空维度的增加导致计算成本急剧上升,限制了长视频的生成。作者发现注意力分数随时空距离增加而衰减的现象,以此为动机开发了更高效的注意力机制。

Contribution: 1. 提出Radial Attention,一种O(n log n)复杂度的稀疏注意力机制,通过模拟能量衰减显著降低计算成本。2. 展示了预训练视频扩散模型可以通过LoRA微调扩展生成长度。

Method: Radial Attention采用静态注意力掩码,每个token仅关注空间邻近的token,注意力窗口随时间距离增大而缩小,从而降低计算密度。

Result: 在多个数据集上保持视频质量的同时,实现了1.9倍加速、4倍生成长度扩展,训练成本降低4.4倍,推理速度提升3.7倍。

Insight: 时空能量衰减现象为设计高效注意力机制提供了自然灵感,稀疏注意力在保持性能的同时可以显著降低计算开销。

Abstract: Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.

cs.PL [Back]

[96] Mix-of-Language-Experts Architecture for Multilingual Programming

Yifan Zong,Yuntian Deng,Pengyu Nie

Main category: cs.PL

TL;DR: 论文提出了MoLE(混合语言专家)架构,用于在多语言编程任务中平衡效率与专业化,通过共享和语言特定的LoRA模块实现知识共享与任务专用。

Details Motivation: 现有的多语言编程方法要么牺牲语言特定性能以实现成本效益,要么计算和存储成本高。MoLE旨在平衡这两者。

Contribution: 提出MoLE架构,结合共享和语言特定的LoRA模块,实现参数高效性和任务专业化的平衡。

Method: 使用基础模型、共享LoRA模块和语言特定LoRA模块的联合优化,推理时自动路由至对应语言的LoRA模块。

Result: 实验显示MoLE在参数效率上优于独立训练的LoRA模块,同时在准确性上优于共享微调的单一模型。

Insight: MoLE提供了一种有效的方式在多语言编程任务中兼顾效率与专业化,为未来的多语言模型设计提供了新思路。

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in aiding developers with tasks like code comprehension, generation, and translation. Supporting multilingual programming – i.e., coding tasks across multiple programming languages – typically requires either (1) finetuning a single LLM across all programming languages, which is cost-efficient but sacrifices language-specific specialization and performance, or (2) finetuning separate LLMs for each programming language, which allows for specialization but is computationally expensive and storage-intensive due to the duplication of parameters. This paper introduces MoLE (Mix-of-Language-Experts), a novel architecture that balances efficiency and specialization for multilingual programming. MoLE is composed of a base model, a shared LoRA (low-rank adaptation) module, and a collection of language-specific LoRA modules. These modules are jointly optimized during the finetuning process, enabling effective knowledge sharing and specialization across programming languages. During inference, MoLE automatically routes to the language-specific LoRA module corresponding to the programming language of the code token being generated. Our experiments demonstrate that MoLE achieves greater parameter efficiency compared to training separate language-specific LoRAs, while outperforming a single shared LLM finetuned for all programming languages in terms of accuracy.

cs.IR [Back]

[97] From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Weizhi Zhang,Yangning Li,Yuanchen Bei,Junyu Luo,Guancheng Wan,Liangwei Yang,Chenxuan Xie,Yuyao Yang,Wei-Chieh Huang,Chunyu Miao,Henry Peng Zou,Xiao Luo,Yusheng Zhao,Yankai Chen,Chunkit Chan,Peilin Zhou,Xinyang Zhang,Chenwei Zhang,Jingbo Shang,Ming Zhang,Yangqiu Song,Irwin King,Philip S. Yu

Main category: cs.IR

TL;DR: 论文提出了从传统关键词搜索转向基于大语言模型(LLM)的‘Agentic Deep Research’新范式,通过自主推理、迭代检索和信息合成的动态反馈循环解决复杂信息需求。

Details Motivation: 传统关键词搜索无法满足复杂、多步骤的信息需求,而具备推理能力的LLM为信息检索提供了新方向。

Contribution: 提出‘Agentic Deep Research’范式,结合自主推理和动态反馈,显著优于现有方法。

Method: 利用LLM的推理能力,构建动态反馈循环系统,整合迭代检索和信息合成,并提出测试时计算深度对推理影响的缩放定律。

Result: 实验结果表明,该方法显著优于传统搜索方法,并有望成为未来信息检索的主流范式。

Insight: 结合推理能力的LLM系统可从根本上改变信息检索方式,提供更高效和动态的解决方案。

Abstract: Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.

cs.ET [Back]

[98] Experimental Assessment of Neural 3D Reconstruction for Small UAV-based Applications

Genís Castillo Gómez-Raya,Álmos Veres-Vitályos,Filip Lemic,Pablo Royo,Mario Montagud,Sergi Fernández,Sergi Abadal,Xavier Costa-Pérez

Main category: cs.ET

TL;DR: 论文提出了一种结合神经3D重建(N3DR)与小无人机(UAV)系统的方法,用于对小型静态物体进行精细的三维重建,显著提升了重建质量。

Details Motivation: 微型无人机的普及使其能够应用于室内和难以到达的区域,但其飞行动态和能耗问题限制了其自主性和任务能力。通过N3DR技术可以解决这些问题。

Contribution: 设计并评估了一种基于N3DR的管道,利用Instant-ngp、Nerfacto和Splatfacto等先进模型,显著提升了小型无人机系统的3D重建质量。

Method: 采用了Instant-ngp、Nerfacto和Splatfacto等模型,与传统的SfM算法对比,评估了图像和点云指标。

Result: 实验结果表明,N3DR管道显著优于基线SfM算法,展示了其在精细3D重建中的潜力。

Insight: N3DR技术有望进一步推动微型无人机系统在受限环境中的应用,如高精度3D地图和异常检测。

Abstract: The increasing miniaturization of Unmanned Aerial Vehicles (UAVs) has expanded their deployment potential to indoor and hard-to-reach areas. However, this trend introduces distinct challenges, particularly in terms of flight dynamics and power consumption, which limit the UAVs’ autonomy and mission capabilities. This paper presents a novel approach to overcoming these limitations by integrating Neural 3D Reconstruction (N3DR) with small UAV systems for fine-grained 3-Dimensional (3D) digital reconstruction of small static objects. Specifically, we design, implement, and evaluate an N3DR-based pipeline that leverages advanced models, i.e., Instant-ngp, Nerfacto, and Splatfacto, to improve the quality of 3D reconstructions using images of the object captured by a fleet of small UAVs. We assess the performance of the considered models using various imagery and pointcloud metrics, comparing them against the baseline Structure from Motion (SfM) algorithm. The experimental results demonstrate that the N3DR-enhanced pipeline significantly improves reconstruction quality, making it feasible for small UAVs to support high-precision 3D mapping and anomaly detection in constrained environments. In more general terms, our results highlight the potential of N3DR in advancing the capabilities of miniaturized UAV systems.

eess.IV [Back]

[99] Assessing Risk of Stealing Proprietary Models for Medical Imaging Tasks

Ankita Raj,Harsh Swaika,Deepankar Varma,Chetan Arora

Main category: eess.IV

TL;DR: 该研究表明专有医疗影像模型面临模型窃取(MS)攻击的风险,并提出了一种名为QueryWise的两步攻击方法,能够在有限查询预算下高效克隆模型功能。

Details Motivation: 随着深度学习在医疗影像中的成功应用,专有模型被部署于诊断流程中。然而这些模型可能受到模型窃取攻击,而目前医疗影像模型对此的研究不足。

Contribution: 1. 揭示了医疗影像模型在现实条件下对MS攻击的脆弱性;2. 提出QueryWise方法,利用公开数据集和代理分布增强攻击能力。

Method: 采用两步模型窃取方法QueryWise,利用代理分布的无标签数据训练窃取模型,无需额外查询。

Result: 在两个医疗影像任务(胆囊癌和COVID-19分类)上验证了攻击的有效性。

Insight: 即使缺乏目标模型的训练数据且预算有限,攻击者仍可通过公开数据有效窃取模型功能,凸显了医疗影像领域模型保护的重要性。

Abstract: The success of deep learning in medical imaging applications has led several companies to deploy proprietary models in diagnostic workflows, offering monetized services. Even though model weights are hidden to protect the intellectual property of the service provider, these models are exposed to model stealing (MS) attacks, where adversaries can clone the model’s functionality by querying it with a proxy dataset and training a thief model on the acquired predictions. While extensively studied on general vision tasks, the susceptibility of medical imaging models to MS attacks remains inadequately explored. This paper investigates the vulnerability of black-box medical imaging models to MS attacks under realistic conditions where the adversary lacks access to the victim model’s training data and operates with limited query budgets. We demonstrate that adversaries can effectively execute MS attacks by using publicly available datasets. To further enhance MS capabilities with limited query budgets, we propose a two-step model stealing approach termed QueryWise. This method capitalizes on unlabeled data obtained from a proxy distribution to train the thief model without incurring additional queries. Evaluation on two medical imaging models for Gallbladder Cancer and COVID-19 classification substantiates the effectiveness of the proposed attack. The source code is available at https://github.com/rajankita/QueryWise.

[100] NIC-RobustBench: A Comprehensive Open-Source Toolkit for Neural Image Compression and Robustness Analysis

Georgii Bychkov,Khaled Abud,Egor Kovalev,Alexander Gushchin,Dmitriy Vatolin,Anastasia Antsiferova

Main category: eess.IV

TL;DR: 该论文介绍了NIC-RobustBench,首个用于评估神经图像压缩(NIC)鲁棒性的开源工具包,支持广泛的编解码器和攻击类型。

Details Motivation: 随着JPEG AI标准的发布,评估NIC方法的鲁棒性变得尤为重要,而现有研究局限于少数编解码器和攻击类型,因此需要一个全面的工具包。

Contribution: 提出了第一个开源框架NIC-RobustBench,支持广泛的NIC编解码器和攻击,同时结合率失真(RD)性能评估,可扩展性强。

Method: 通过开源工具包集成多种NIC编解码器和攻击方法,提供统一的评估平台,并支持对抗防御效率分析。

Result: NIC-RobustBench是目前包含最多编解码器的库,为NIC鲁棒性研究提供了全面的分析工具。

Insight: NIC鲁棒性评估需结合多种编解码器和攻击类型,开源工具包可以推动该领域的标准化和发展。

Abstract: Adversarial robustness of neural networks is an increasingly important area of research, combining studies on computer vision models, large language models (LLMs), and others. With the release of JPEG AI – the first standard for end-to-end neural image compression (NIC) methods – the question of evaluating NIC robustness has become critically significant. However, previous research has been limited to a narrow range of codecs and attacks. To address this, we present \textbf{NIC-RobustBench}, the first open-source framework to evaluate NIC robustness and adversarial defenses’ efficiency, in addition to comparing Rate-Distortion (RD) performance. The framework includes the largest number of codecs among all known NIC libraries and is easily scalable. The paper demonstrates a comprehensive overview of the NIC-RobustBench framework and employs it to analyze NIC robustness. Our code is available online at https://github.com/msu-video-group/NIC-RobustBench.

[101] Xray2Xray: World Model from Chest X-rays with Volumetric Context

Zefan Yang,Xinrui Song,Xuanang Xu,Yongyi Shi,Ge Wang,Mannudeep K. Kalra,Pingkun Yan

Main category: eess.IV

TL;DR: 论文提出Xray2Xray,一种从2D X射线学习3D结构信息的World Model,通过建模不同视角的动态转换,提升疾病诊断和风险预测效果。

Details Motivation: 2D胸部X光片因结构叠加限制了精确诊断和风险预测的能力,需从2D图像中提取3D结构信息以提升性能。

Contribution: 提出Xray2Xray模型,首次从2D X射线中学习3D结构的潜在表示,并用于下游任务,效果优于现有方法。

Method: 结合视觉模型和动态转换模型,建模X射线在不同角度的投影动态,学习3D结构信息的潜在表示。

Result: 在心血管疾病风险预测和五种病理分类任务中表现优异,并能重建体积上下文信息。

Insight: 证明了从2D医学图像中学习3D结构信息的可行性,为医学影像分析提供了新思路。

Abstract: Chest X-rays (CXRs) are the most widely used medical imaging modality and play a pivotal role in diagnosing diseases. However, as 2D projection images, CXRs are limited by structural superposition, which constrains their effectiveness in precise disease diagnosis and risk prediction. To address the limitations of 2D CXRs, this study introduces Xray2Xray, a novel World Model that learns latent representations encoding 3D structural information from chest X-rays. Xray2Xray captures the latent representations of the chest volume by modeling the transition dynamics of X-ray projections across different angular positions with a vision model and a transition model. We employed the latent representations of Xray2Xray for downstream risk prediction and disease diagnosis tasks. Experimental results showed that Xray2Xray outperformed both supervised methods and self-supervised pretraining methods for cardiovascular disease risk estimation and achieved competitive performance in classifying five pathologies in CXRs. We also assessed the quality of Xray2Xray’s latent representations through synthesis tasks and demonstrated that the latent representations can be used to reconstruct volumetric context.

[102] Deformable Medical Image Registration with Effective Anatomical Structure Representation and Divide-and-Conquer Network

Xinke Ma,Yongsheng Pan,Qingjie Zeng,Mengkang Lu,Bolysbek Murat Yerzhanuly,Bazargul Matkerim,Yong Xia

Main category: eess.IV

TL;DR: 论文提出了一种基于ROI的医学图像配准方法EASR-DCN,通过有效表征ROI并利用分治网络独立对齐ROI,显著提升了配准性能。

Details Motivation: 当前无监督和弱监督的医学图像配准方法在ROI表征和独立对齐方面存在不足,限制了配准性能的提升。

Contribution: 提出了一种名为EASR-DCN的新方法,通过高斯混合模型表征ROI,并利用分治网络实现独立对齐,无需标注数据。

Method: 使用高斯混合模型进行强度分析以表征ROI,设计分治网络(DCN)通过多通道独立学习ROI对齐特征,最后整合成位移矢量场。

Result: 在三个MRI和一个CT数据集上,EASR-DCN相比VoxelMorph在Dice分数上显著提升(脑MRI 10.31%,心脏MRI 13.01%,海马体MRI 5.75%)。

Insight: 有效表征和独立处理ROI是提升医学图像配准性能的关键;无监督方法结合分治策略可减少对标注数据的依赖。

Abstract: Effective representation of Regions of Interest (ROI) and independent alignment of these ROIs can significantly enhance the performance of deformable medical image registration (DMIR). However, current learning-based DMIR methods have limitations. Unsupervised techniques disregard ROI representation and proceed directly with aligning pairs of images, while weakly-supervised methods heavily depend on label constraints to facilitate registration. To address these issues, we introduce a novel ROI-based registration approach named EASR-DCN. Our method represents medical images through effective ROIs and achieves independent alignment of these ROIs without requiring labels. Specifically, we first used a Gaussian mixture model for intensity analysis to represent images using multiple effective ROIs with distinct intensities. Furthermore, we propose a novel Divide-and-Conquer Network (DCN) to process these ROIs through separate channels to learn feature alignments for each ROI. The resultant correspondences are seamlessly integrated to generate a comprehensive displacement vector field. Extensive experiments were performed on three MRI and one CT datasets to showcase the superior accuracy and deformation reduction efficacy of our EASR-DCN. Compared to VoxelMorph, our EASR-DCN achieved improvements of 10.31% in the Dice score for brain MRI, 13.01% for cardiac MRI, and 5.75% for hippocampus MRI, highlighting its promising potential for clinical applications. The code for this work will be released upon acceptance of the paper.

[103] Explicit Residual-Based Scalable Image Coding for Humans and Machines

Yui Tatsumi,Ziyue Zeng,Hiroshi Watanabe

Main category: eess.IV

TL;DR: 本文提出了一种基于显式残差的可扩展图像编码方法(FR-ICMH和PR-ICMH),用于同时服务于人类和机器视觉需求,提升了编码效率和可解释性。

Details Motivation: 随着图像越来越多地被人类和机器识别模型共同使用,需要一种能够同时满足两者需求的可扩展图像压缩方法。现有方法过于依赖神经网络的学习能力,而忽视了架构设计的重要性。

Contribution: 提出了两种基于显式残差的可扩展编码方法(FR-ICMH和PR-ICMH),增强了编码效率和可解释性,并支持多种机器视觉任务。

Method: 通过在ICMH框架中集成显式残差压缩机制(类似于JPEG2000的分辨率可扩展编码方法),设计了FR-ICMH和PR-ICMH两种互补方法。

Result: 实验表明,PR-ICMH比现有方法节省了高达29.57%的BD-rate。

Insight: 显式残差机制的引入不仅提升了压缩性能,还提供了编码复杂度和压缩效率之间的灵活权衡。

Abstract: Scalable image compression is a technique that progressively reconstructs multiple versions of an image for different requirements. In recent years, images have increasingly been consumed not only by humans but also by image recognition models. This shift has drawn growing attention to scalable image compression methods that serve both machine and human vision (ICMH). Many existing models employ neural network-based codecs, known as learned image compression, and have made significant strides in this field by carefully designing the loss functions. In some cases, however, models are overly reliant on their learning capacity, and their architectural design is not sufficiently considered. In this paper, we enhance the coding efficiency and interpretability of ICMH framework by integrating an explicit residual compression mechanism, which is commonly employed in resolution scalable coding methods such as JPEG2000. Specifically, we propose two complementary methods: Feature Residual-based Scalable Coding (FR-ICMH) and Pixel Residual-based Scalable Coding (PR-ICMH). These proposed methods are applicable to various machine vision tasks. Moreover, they provide flexibility to choose between encoder complexity and compression performance, making it adaptable to diverse application requirements. Experimental results demonstrate the effectiveness of our proposed methods, with PR-ICMH achieving up to 29.57% BD-rate savings over the previous work.

[104] Reconsidering Explicit Longitudinal Mammography Alignment for Enhanced Breast Cancer Risk Prediction

Solveig Thrun,Stine Hansen,Zijun Sun,Nele Blum,Suaiba A. Salahuddin,Kristoffer Wickstrøm,Elisabeth Wetzer,Robert Jenssen,Maik Stille,Michael Kampffmeyer

Main category: eess.IV

TL;DR: 本文探讨了在乳腺X光检查中,显式纵向对齐对乳腺癌风险预测的影响,比较了输入空间与表示空间对齐的优劣,并提出了图像级对齐优于表示级对齐的结论。

Details Motivation: 乳腺X光检查的时间序列数据对乳腺癌风险预测至关重要。然而,如何在不同时间点的检查中进行空间对齐以捕捉组织变化仍是一个未充分探索的问题。

Contribution: 本文提供了关于显式对齐应在何处进行(输入空间vs.表示空间)以及是否应联合优化对齐和风险预测的见解,并证明了图像级对齐的优越性。

Method: 研究比较了输入空间与表示空间的显式对齐方法,联合优化对齐质量和风险预测性能,发现图像级对齐在变形场质量和预测准确性上表现更优。

Result: 结果表明,联合优化对齐和风险预测会导致对齐质量与预测性能的权衡,而图像级对齐在变形场质量和风险预测准确性上优于表示级对齐。

Insight: 图像级对齐更适合乳腺X光检查的时间序列数据,能够提供更高质量的变形场并提升风险预测性能,为未来的研究提供了方向。

Abstract: Regular mammography screening is essential for early breast cancer detection. Deep learning-based risk prediction methods have sparked interest to adjust screening intervals for high-risk groups. While early methods focused only on current mammograms, recent approaches leverage the temporal aspect of screenings to track breast tissue changes over time, requiring spatial alignment across different time points. Two main strategies for this have emerged: explicit feature alignment through deformable registration and implicit learned alignment using techniques like transformers, with the former providing more control. However, the optimal approach for explicit alignment in mammography remains underexplored. In this study, we provide insights into where explicit alignment should occur (input space vs. representation space) and if alignment and risk prediction should be jointly optimized. We demonstrate that jointly learning explicit alignment in representation space while optimizing risk estimation performance, as done in the current state-of-the-art approach, results in a trade-off between alignment quality and predictive performance and show that image-level alignment is superior to representation-level alignment, leading to better deformation field quality and enhanced risk prediction accuracy. The code is available at https://github.com/sot176/Longitudinal_Mammogram_Alignment.git.

[105] Filling of incomplete sinograms from sparse PET detector configurations using a residual U-Net

Klara Leffler,Luigi Tommaso Luppino,Samuel Kuttner,Karin Söderkvist,Jan Axelsson

Main category: eess.IV

TL;DR: 论文提出了一种基于改进的Residual U-Net的方法,用于恢复稀疏PET探头配置下缺失的投影数据,以降低长轴PET扫描仪的成本。

Details Motivation: 传统的长轴PET扫描仪需要密集的光电探测器,成本高昂。稀疏探测器配置可降低成本,但会牺牲图像质量。本文旨在通过深度学习技术恢复缺失数据。

Contribution: 提出了一种基于Residual U-Net的深度学习模型,成功恢复了稀疏配置下缺失的投影数据,性能优于传统二维插值方法。

Method: 使用改进的Residual U-Net,在临床PET数据上训练,模拟移除50%探测器(棋盘式分布),恢复缺失的投影数据。

Result: 模型能够有效恢复数据,平均绝对误差低于每像素两个事件。尽管存在图像细节模糊问题,但显著优于传统方法。

Insight: 稀疏探测器配置结合深度学习是降低PET扫描仪成本的一种可行方案,推动了低成本、全身PET扫描仪的发展。

Abstract: Long axial field-of-view PET scanners offer increased field-of-view and sensitivity compared to traditional PET scanners. However, a significant cost is associated with the densely packed photodetectors required for the extended-coverage systems, limiting clinical utilisation. To mitigate the cost limitations, alternative sparse system configurations have been proposed, allowing an extended field-of-view PET design with detector costs similar to a standard PET system, albeit at the expense of image quality. In this work, we propose a deep sinogram restoration network to fill in the missing sinogram data. Our method utilises a modified Residual U-Net, trained on clinical PET scans from a GE Signa PET/MR, simulating the removal of 50% of the detectors in a chessboard pattern (retaining only 25% of all lines of response). The model successfully recovers missing counts, with a mean absolute error below two events per pixel, outperforming 2D interpolation in both sinogram and reconstructed image domain. Notably, the predicted sinograms exhibit a smoothing effect, leading to reconstructed images lacking sharpness in finer details. Despite these limitations, the model demonstrates a substantial capacity for compensating for the undersampling caused by the sparse detector configuration. This proof-of-concept study suggests that sparse detector configurations, combined with deep learning techniques, offer a viable alternative to conventional PET scanner designs. This approach supports the development of cost-effective, total body PET scanners, allowing a significant step forward in medical imaging technology.

[106] NeRF-based CBCT Reconstruction needs Normalization and Initialization

Zhuowei Xu,Han Li,Dai Sun,Zhicheng Li,Yujia Li,Qingpeng Kong,Zhiwei Cheng,Nassir Navab,S. Kevin Zhou

Main category: eess.IV

TL;DR: 该论文提出了一种归一化哈希编码器和映射一致性初始化策略,以解决NeRF-based CBCT重建中局部-全局训练不匹配问题,从而提升训练稳定性和重建质量。

Details Motivation: NeRF-based CBCT重建方法中,哈希编码器与神经网络的局部稀疏和全局密集训练不匹配导致特征不对齐,进而影响训练稳定性和重建效果。

Contribution: 1. 提出归一化哈希编码器以提升特征一致性;2. 引入映射一致性初始化策略(MCI)以改善早期训练稳定性。

Method: 1. 通过归一化哈希编码器缓解局部-全局优化不匹配;2. 利用预训练模型的全局映射特性初始化神经网络。

Result: 在4个数据集、128个CT病例上验证了方法的有效性,显著提升训练效率和重建性能。

Insight: 局部-全局训练不匹配是NeRF-based方法在CBCT重建中的关键问题,简单的归一化和初始化策略可显著缓解这一问题。

Abstract: Cone Beam Computed Tomography (CBCT) is widely used in medical imaging. However, the limited number and intensity of X-ray projections make reconstruction an ill-posed problem with severe artifacts. NeRF-based methods have achieved great success in this task. However, they suffer from a local-global training mismatch between their two key components: the hash encoder and the neural network. Specifically, in each training step, only a subset of the hash encoder’s parameters is used (local sparse), whereas all parameters in the neural network participate (global dense). Consequently, hash features generated in each step are highly misaligned, as they come from different subsets of the hash encoder. These misalignments from different training steps are then fed into the neural network, causing repeated inconsistent global updates in training, which leads to unstable training, slower convergence, and degraded reconstruction quality. Aiming to alleviate the impact of this local-global optimization mismatch, we introduce a Normalized Hash Encoder, which enhances feature consistency and mitigates the mismatch. Additionally, we propose a Mapping Consistency Initialization(MCI) strategy that initializes the neural network before training by leveraging the global mapping property from a well-trained model. The initialized neural network exhibits improved stability during early training, enabling faster convergence and enhanced reconstruction performance. Our method is simple yet effective, requiring only a few lines of code while substantially improving training efficiency on 128 CT cases collected from 4 different datasets, covering 7 distinct anatomical regions.

cs.RO [Back]

[107] Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects

Federico Tavella,Kathryn Mearns,Angelo Cangelosi

Main category: cs.RO

TL;DR: 这篇论文比较了多种视觉语言模型(VLMs)在机器人场景理解中的表现,分析了单视角与多视角描述、真实与3D打印物体识别的差异,并提供了在实际机器人任务中部署基础模型的实用见解。

Details Motivation: 随着机器人场景理解越来越依赖视觉语言模型(VLMs),需要评估这些模型在真实和3D打印物体上的表现,以确定它们在实际机器人任务中的适用性和局限性。

Contribution: 论文的主要贡献包括:比较了多种视觉语言模型在机器人场景中的表现,分析了单视角与多视角描述的优劣,并揭示了模型在新颖物体表示上的泛化能力不足。

Method: 通过配备RGB相机的机器人手臂采集桌面场景的多视角图像,使用BLIP和其他VLMs生成场景描述,定量评估物体识别准确性、描述完整性和自然性。

Result: 实验结果表明,视觉语言模型在识别常见物体时表现良好,但在处理新颖的3D打印物体时泛化能力不足。

Insight: 研究发现,视觉语言模型在机器人任务中具有潜力,但需要进一步改进以提升对新物体的泛化能力,确保在实际部署中的可靠性。

Abstract: Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.

[108] CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

Hao Li,Shuai Yang,Yilun Chen,Yang Tian,Xiaoda Yang,Xinyi Chen,Hanqing Wang,Tai Wang,Feng Zhao,Dahua Lin,Jiangmiao Pang

Main category: cs.RO

TL;DR: CronusVLA扩展了单帧视觉-语言-动作(VLA)模型,通过高效的后训练阶段实现多帧预测,提升了运动信息的利用效率和任务成功率。

Details Motivation: 现有的VLA模型受限于单帧观测范式,无法充分利用多帧历史观测的运动信息,且计算成本高昂。CronusVLA旨在通过高效方法实现多帧建模。

Contribution: 1)提出CronusVLA框架,将单帧VLA扩展到多帧;2)引入运动特征编码与跨帧解码机制;3)基于特征-动作检索的动作适应机制。

Method: 包括单帧预训练、多帧运动特征编码与聚合、跨帧解码三部分,通过缓存历史运动特征降低计算冗余。

Result: 在SimperEnv上达到70.9%成功率,LIBERO上比OpenVLA提升12.7%,并在真实机器人实验(Franka)中表现稳健。

Insight: 通过高效利用历史帧运动特征,能够显著提升动作预测的准确性和泛化能力,同时保持低计算开销。

Abstract: Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness.

[109] Look to Locate: Vision-Based Multisensory Navigation with 3-D Digital Maps for GNSS-Challenged Environments

Ola Elmaghraby,Eslam Mounier,Paulo Ricardo Marques de Araujo,Aboelmagd Noureldin

Main category: cs.RO

TL;DR: 论文提出了一种基于视觉的低成本多传感器导航系统,结合单目深度估计、语义过滤和视觉地图注册(VMR),用于GNSS受限环境下的车辆定位,显著提高了定位精度和鲁棒性。

Details Motivation: 在GNSS信号受限的环境(如室内停车场或密集城市峡谷)中,实现精确且鲁棒的车辆定位是一项重要挑战。作者希望通过低成本视觉系统解决这一问题。

Contribution: 主要贡献是提出了一种结合单目深度估计、语义过滤和视觉地图注册的多传感器导航系统,显著提升了GNSS受限环境下的定位精度(室内92%亚米级精度)。

Method: 方法包括单目深度估计、语义过滤及与3D数字地图的视觉地图注册(VMR),通过多传感器融合实现定位。

Result: 实验结果表明系统在室内外均表现优异,室内达到92%亚米级精度,室外超过80%,定位精度平均提升88%。

Insight: 论文展示了低成本单目视觉系统结合3D地图在GNSS受限环境下实现可扩展导航的潜力,为相关领域提供了实用解决方案。

Abstract: In Global Navigation Satellite System (GNSS)-denied environments such as indoor parking structures or dense urban canyons, achieving accurate and robust vehicle positioning remains a significant challenge. This paper proposes a cost-effective, vision-based multi-sensor navigation system that integrates monocular depth estimation, semantic filtering, and visual map registration (VMR) with 3-D digital maps. Extensive testing in real-world indoor and outdoor driving scenarios demonstrates the effectiveness of the proposed system, achieving sub-meter accuracy of 92% indoors and more than 80% outdoors, with consistent horizontal positioning and heading average root mean-square errors of approximately 0.98 m and 1.25 {\deg}, respectively. Compared to the baselines examined, the proposed solution significantly reduced drift and improved robustness under various conditions, achieving positioning accuracy improvements of approximately 88% on average. This work highlights the potential of cost-effective monocular vision systems combined with 3D maps for scalable, GNSS-independent navigation in land vehicles.

cs.LG [Back]

[110] Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

Zihan Wang,Rui Pan,Jiarui Yao,Robert Csordas,Linjie Li,Lu Yin,Jiajun Wu,Tong Zhang,Manling Li,Shiwei Liu

Main category: cs.LG

TL;DR: 论文提出了Chain-of-Experts (CoE),一种新的Mixture-of-Experts (MoE)架构,通过在每层中引入专家间的顺序通信,提升了模型的表达能力和计算效率。

Details Motivation: 传统的MoE模型中,专家独立并行工作,限制了专家间的交互能力。为此,CoE设计了一种顺序通信机制,使专家能在每层中动态交互,从而提升模型的表达能力。

Contribution: 1. 提出了CoE架构,通过顺序专家通信增强MoE的表达能力;2. 设计了动态路由机制,支持专家在迭代中重新选择;3. 展示了性能提升和计算效率的优化。

Method: CoE在每层中采用顺序的专家链,通过迭代路由动态选择专家。每个迭代步骤都有专用的路由器,支持令牌在迭代中重新选择专家。

Result: 在数学推理任务中,CoE将验证损失从1.20降至1.12(相比标准MoE);通过2倍迭代匹配3倍宽度扩展的性能,同时内存占用减少17.6-42%。

Insight: CoE通过迭代残差结构和路由机制提升了专家的特化能力,为模型扩展提供了新方向(深度扩展),突破了传统的宽度/深度扩展限制。

Abstract: We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model’s representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE’s benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.

[111] Thought Anchors: Which LLM Reasoning Steps Matter?

Paul C. Bogdan,Uzay Macar,Neel Nanda,Arthur Conmy

Main category: cs.LG

TL;DR: 本文提出三种互补的归因方法,用于分析大语言模型(LLM)推理过程中的关键步骤(即“思维锚点”),并开发了可视化工具,展示了句子级别分析在理解推理模型中的潜力。

Details Motivation: 尽管大语言模型在推理任务中表现优异,但其长链式思维推理的复杂性导致可解释性挑战。作者认为句子级别的分析是理解推理过程的有效途径。

Contribution: 1. 提出三种互补的归因方法(黑盒、白盒和因果方法)识别思维锚点;2. 发现思维锚点(如计划或回溯句子)对后续推理有重要影响;3. 提供开源工具支持可视化分析。

Method: 1. 黑盒方法:通过对比不同句子生成条件下的最终答案,测量句子的反事实重要性;2. 白盒方法:聚合句子间的注意力模式,识别“广播”句子;3. 因果方法:通过抑制注意力测量句子间的逻辑连接。

Result: 实验表明思维锚点确实存在,且对推理过程有重要作用。三种方法的结果一致性证明了句子级别分析的有效性。

Insight: 句子级别分析能够揭示推理模型的关键步骤,思维锚点的发现为模型可解释性提供了新视角。

Abstract: Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method measuring each sentence’s counterfactual importance by comparing final answers across 100 rollouts conditioned on the model generating that sentence or one with a different meaning; (2) a white-box method of aggregating attention patterns between pairs of sentences, which identified broadcasting'' sentences that receive disproportionate attention from all future sentences via receiver’’ attention heads; (3) a causal attribution method measuring logical connections between sentences by suppressing attention toward one sentence and measuring the effect on each future sentence’s tokens. Each method provides evidence for the existence of thought anchors, reasoning steps that have outsized importance and that disproportionately influence the subsequent reasoning process. These thought anchors are typically planning or backtracking sentences. We provide an open-source tool (www.thought-anchors.com) for visualizing the outputs of our methods, and present a case study showing converging patterns across methods that map how a model performs multi-step reasoning. The consistency across methods demonstrates the potential of sentence-level analysis for a deeper understanding of reasoning models.

[112] Scaling Speculative Decoding with Lookahead Reasoning

Yichao Fu,Rui Ge,Zelei Shao,Zhijie Deng,Hao Zhang

Main category: cs.LG

TL;DR: 该论文通过引入前瞻性推理(Lookahead Reasoning)技术,解决了现有推测解码(Speculative Decoding, SD)在长链推理任务中速度提升有限的问题,实现了更高的并行性和更快的推理速度。

Details Motivation: 现有SD技术在处理长链推理任务时,由于其速度提升受限于指数级下降的猜测准确性,无法充分利用硬件的计算能力。因此,需要一种新的方法突破这一算法天花板。

Contribution: 论文提出前瞻性推理,通过引入第二层并行性(步骤级),结合语义验证机制,显著提升了SD的速度上限,并在多个基准测试中证实了其有效性。

Method: 提出了一种两步并行框架:1)轻量级草案模型生成多个未来推理步骤;2)目标模型批量扩展这些步骤,并通过验证器保留语义正确的步骤,同时重新生成失败的步骤。

Result: 在GSM8K、AIME等基准测试中,前瞻性推理将SD的峰值速度提升从1.4倍提升至2.1倍,且不损失答案质量。

Insight: 语义正确性比精确的Token匹配更能有效提升并行效率,结合轻量级草案模型和语义验证机制,可以在多步推理任务中突破传统SD的限制。

Abstract: Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire $\gamma$-token guess is correct falls exponentially as $\gamma$ grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling – making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning

[113] ConCM: Consistency-Driven Calibration and Matching for Few-Shot Class-Incremental Learning

QinZhe Wang,Zixuan Chen,Keke Huang,Xiu Su,Chunhua Yang,Chang Xu

Main category: cs.LG

TL;DR: 该论文提出了一种一致性驱动的校准和匹配框架(ConCM),用于解决少样本类增量学习(FSCIL)中的原型偏差和结构固定问题,通过双重一致性增强特征表达,取得SOTA性能。

Details Motivation: FSCIL中现有的空间预留方法因原型偏差和结构固定导致特征表达能力受限。论文通过优化特征与结构的双重一致性,缓解了知识冲突问题。

Contribution: 提出ConCM框架,设计记忆感知原型校准和动态结构匹配,实现了特征中心一致性和跨会话结构一致性,无需类数先验知识。

Method: 1)基于海马联想记忆设计原型校准,增强特征语义一致性;2)动态匹配校准特征到会话最优流形空间,确保结构一致性。

Result: 在mini-ImageNet和CUB200数据集上,ConCM的增量会话谐波准确率分别比现有最优方法提升3.20%和3.68%。

Insight: 通过几何最优性和最大匹配的理论分析,论文表明双重一致性对FSCIL的有效性,为未来研究提供了新方向。

Abstract: Few-Shot Class-Incremental Learning (FSCIL) requires models to adapt to novel classes with limited supervision while preserving learned knowledge. Existing prospective learning-based space construction methods reserve space to accommodate novel classes. However, prototype deviation and structure fixity limit the expressiveness of the embedding space. In contrast to fixed space reservation, we explore the optimization of feature-structure dual consistency and propose a Consistency-driven Calibration and Matching Framework (ConCM) that systematically mitigate the knowledge conflict inherent in FSCIL. Specifically, inspired by hippocampal associative memory, we design a memory-aware prototype calibration that extracts generalized semantic attributes from base classes and reintegrates them into novel classes to enhance the conceptual center consistency of features. Further, we propose dynamic structure matching, which adaptively aligns the calibrated features to a session-specific optimal manifold space, ensuring cross-session structure consistency. Theoretical analysis shows that our method satisfies both geometric optimality and maximum matching, thereby overcoming the need for class-number priors. On large-scale FSCIL benchmarks including mini-ImageNet and CUB200, ConCM achieves state-of-the-art performance, surpassing current optimal method by 3.20% and 3.68% in harmonic accuracy of incremental sessions.

q-bio.NC [Back]

[114] Convergent and divergent connectivity patterns of the arcuate fasciculus in macaques and humans

Jiahao Huang,Ruifeng Li,Wenwen Yu,Anan Li,Xiangning Li,Mingchao Yan,Lei Xie,Qingrun Zeng,Xueyan Jia,Shuxin Wang,Ronghui Ju,Feng Chen,Qingming Luo,Hui Gong,Xiaoquan Yang,Yuanjing Feng,Zheng Wang

Main category: q-bio.NC

TL;DR: 该论文通过比较猕猴和人类弓状束(AF)的神经连接模式,揭示了物种间在语言网络进化上的差异。研究结合单神经元追踪和全脑扩散MRI,发现人类AF具有更广的颞叶整合及更强的前额-顶叶连接。

Details Motivation: 研究动机在于解决猕猴和人类弓状束(AF)的神经连接差异问题,以理解人类语言网络的进化基础及其相关疾病(如失语症和阅读障碍)的解剖学机制。

Contribution: 论文的主要贡献在于:1)通过跨尺度的单神经元追踪和高分辨率MRI,首次系统地比较了猕猴和人类AF的连接模式;2)量化了物种间的连接差异(如人类AF更广的颞叶整合),为语言进化提供了神经解剖学证据;3)为AF相关疾病的机制研究提供了新框架。

Method: 研究方法包括:1)利用病毒标记和荧光显微镜技术在猕猴中进行单神经元追踪;2)11.7T扩散MRI进行全脑束路成像;3)结合人类的7.0T MRI和谱嵌入分析,进行跨物种连接组学比较;4)使用Kullback-Leibler分析量化连接差异。

Result: 结果显示猕猴AF主要起源于颞顶皮层,经听觉皮层和顶盖投射至前额区域;而人类AF则扩展至中颞回,并具有更强的前额-顶盖连接。这些差异可能支撑了人类语言网络的进化特化。

Insight: 研究启示在于:1)人类AF的更广泛颞叶整合和强化的前额-顶叶连接可能是高级语言处理能力的神经基础;2)AF连接模式的差异为理解语言相关疾病提供了新视角。

Abstract: The organization and connectivity of the arcuate fasciculus (AF) in nonhuman primates remain contentious, especially concerning how its anatomy diverges from that of humans. Here, we combined cross-scale single-neuron tracing - using viral-based genetic labeling and fluorescence micro-optical sectioning tomography in macaques (n = 4; age 3 - 11 years) - with whole-brain tractography from 11.7T diffusion MRI. Complemented by spectral embedding analysis of 7.0T MRI in humans, we performed a comparative connectomic analysis of the AF across species. We demonstrate that the macaque AF originates in the temporal-parietal cortex, traverses the auditory cortex and parietal operculum, and projects into prefrontal regions. In contrast, the human AF exhibits greater expansion into the middle temporal gyrus and stronger prefrontal and parietal operculum connectivity - divergences quantified by Kullback-Leibler analysis that likely underpin the evolutionary specialization of human language networks. These interspecies differences - particularly the human AF’s broader temporal integration and strengthened frontoparietal linkages - suggest a connectivity-based substrate for the emergence of advanced language processing unique to humans. Furthermore, our findings offer a neuroanatomical framework for understanding AF-related disorders such as aphasia and dyslexia, where aberrant connectivity disrupts language function.

eess.AS [Back]

[115] Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Jun Wang,Xijuan Zeng,Chunyu Qiang,Ruilong Chen,Shiyao Wang,Le Wang,Wangjing Zhou,Pengfei Cai,Jiahui Zhao,Nan Li,Zihan Li,Yuzhe Liang,Xiaopeng Wang,Haorui Zheng,Ming Wen,Kang Yin,Yiran Wang,Nan Li,Feng Deng,Liang Dong,Chen Zhang,Di Zhang,Kun Gai

Main category: eess.AS

TL;DR: Kling-Foley是一种多模态视频到音频生成模型,通过多模态扩散变压器和视觉语义表示模块增强音视频同步与语义对齐,结合通用音频编解码器和立体声渲染技术,显著提升生成音频的质量和空间感。

Details Motivation: 现有的视频到音频生成方法在语义对齐、音视频同步和音频质量上仍有不足,Kling-Foley旨在通过多模态建模和高级对齐模块解决这些问题。

Contribution: 1. 提出多模态扩散变压器建模视频、音频和文本的交互;2. 引入视觉语义表示和音视频同步模块提升对齐能力;3. 提出通用音频编解码器和立体声渲染技术;4. 开源工业级评测基准Kling-Audio-Eval。

Method: 结合多模态扩散变压器、视觉语义表示模块和音视频同步模块,通过帧级对齐和文本条件生成精确匹配视频的声音。采用流匹配目标进行训练。

Result: 实验表明,Kling-Foley在分布匹配、语义对齐、时间对齐和音频质量上达到了公开模型的SOTA性能。

Insight: 多模态建模和高精度对齐模块是提升视频到音频生成质量的关键。通用音频编解码器和立体声渲染技术为多样化场景提供了灵活性。

Abstract: We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.

cs.GR [Back]

[116] SOF: Sorted Opacity Fields for Fast Unbounded Surface Reconstruction

Lukas Radl,Felix Windisch,Thomas Deixelberger,Jozef Hladky,Michael Steiner,Dieter Schmalstieg,Markus Steinberger

Main category: cs.GR

TL;DR: SOF是一种基于3D高斯表示的无边界场景表面重建方法,通过分层重排序和鲁棒的高斯深度定义,结合水平集正则化和并行化的四面体行进算法,显著提高了重建精度和效率。

Details Motivation: 当前基于3D高斯表示的场景重建方法在提取高精度表面时,尤其是在大规模无边界环境中,存在深度估计不准确和排序启发式方法导致的伪影问题。

Contribution: 1. 提出了分层重排序和鲁棒的高斯深度定义;2. 引入了水平集正则化和几何一致性损失;3. 开发了并行化的四面体行进算法。

Method: 1. 引入分层重排序优化深度对齐;2. 使用水平集正则化和几何损失改进网格质量;3. 并行化四面体行进算法加速重建。

Result: SOF在重建精度上优于现有方法,同时总处理时间减少了三倍以上。

Insight: 通过结合高效的渲染技术与几何提取方法,SOF为大规模无边界场景的高精度实时重建提供了新的解决方案。

Abstract: Recent advances in 3D Gaussian representations have significantly improved the quality and efficiency of image-based scene reconstruction. Their explicit nature facilitates real-time rendering and fast optimization, yet extracting accurate surfaces - particularly in large-scale, unbounded environments - remains a difficult task. Many existing methods rely on approximate depth estimates and global sorting heuristics, which can introduce artifacts and limit the fidelity of the reconstructed mesh. In this paper, we present Sorted Opacity Fields (SOF), a method designed to recover detailed surfaces from 3D Gaussians with both speed and precision. Our approach improves upon prior work by introducing hierarchical resorting and a robust formulation of Gaussian depth, which better aligns with the level-set. To enhance mesh quality, we incorporate a level-set regularizer operating on the opacity field and introduce losses that encourage geometrically-consistent primitive shapes. In addition, we develop a parallelized Marching Tetrahedra algorithm tailored to our opacity formulation, reducing meshing time by up to an order of magnitude. As demonstrated by our quantitative evaluation, SOF achieves higher reconstruction accuracy while cutting total processing time by more than a factor of three. These results mark a step forward in turning efficient Gaussian-based rendering into equally efficient geometry extraction.

[117] Virtual Memory for 3D Gaussian Splatting

Jonathan Haberl,Philipp Fleck,Clemens Arth

Main category: cs.GR

TL;DR: 提出了利用虚拟内存技术优化3D高斯泼溅(3D Gaussian Splatting)渲染的方法,通过动态加载可见高斯分布减少内存占用并加速渲染。

Details Motivation: 3D高斯泼溅在场景重建中表现出色,但随着场景复杂度增加,内存和渲染速度成为瓶颈。本文旨在通过虚拟内存技术解决这一问题。

Contribution: 1. 引入虚拟内存和虚拟纹理技术,动态加载可见高斯分布;2. 集成了细节层次(LOD)技术进一步提升渲染效率;3. 在桌面和移动设备上验证了方法的有效性。

Method: 基于虚拟内存和虚拟纹理技术,动态识别和加载可见高斯分布。结合LOD技术,优化渲染流程。

Result: 减少了内存占用,提升了渲染速度,尤其适用于复杂场景。在多种设备上表现出色。

Insight: 虚拟内存技术在3D渲染中具有潜力,可以有效解决复杂场景的内存和性能问题。

Abstract: 3D Gaussian Splatting represents a breakthrough in the field of novel view synthesis. It establishes Gaussians as core rendering primitives for highly accurate real-world environment reconstruction. Recent advances have drastically increased the size of scenes that can be created. In this work, we present a method for rendering large and complex 3D Gaussian Splatting scenes using virtual memory. By leveraging well-established virtual memory and virtual texturing techniques, our approach efficiently identifies visible Gaussians and dynamically streams them to the GPU just in time for real-time rendering. Selecting only the necessary Gaussians for both storage and rendering results in reduced memory usage and effectively accelerates rendering, especially for highly complex scenes. Furthermore, we demonstrate how level of detail can be integrated into our proposed method to further enhance rendering speed for large-scale scenes. With an optimized implementation, we highlight key practical considerations and thoroughly evaluate the proposed technique and its impact on desktop and mobile devices.

[118] Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders

Matyas Bohacek,Thomas Fel,Maneesh Agrawala,Ekdeep Singh Lubana

Main category: cs.GR

TL;DR: 论文提出了一种系统性方法来识别和量化生成图像模型的‘概念盲点’,即训练数据中存在但模型生成中缺失或错误表示的概念,通过稀疏自编码器(SAE)提取可解释的概念嵌入进行定量分析。

Details Motivation: 生成图像模型在简单概念上(如人手或四个物体的组合)表现不佳,但这些问题是偶发异常还是模型结构性缺陷尚不清楚。论文旨在系统性地识别和量化这些‘概念盲点’。

Contribution: 1. 提出了一种系统性识别和量化概念盲点的方法。2. 训练了迄今为止最大的稀疏自编码器(RA-SAE),包含32,000个概念。3. 发现了生成模型中特定的‘抑制盲点’和‘夸张盲点’。

Method: 1. 利用稀疏自编码器(SAE)提取可解释的概念嵌入。2. 在DINOv2特征上训练RA-SAE,实现对生成图像和真实图像的概念分布定量比较。3. 分析了四种流行生成模型(Stable Diffusion 1.5/2.1, PixArt, Kandinsky)的概念盲点。

Result: 1. 发现了生成模型中特定概念缺失(如鸟食器、DVD光盘)或过度表现(如木质纹理、棕榈树)。2. 在数据点级别隔离了记忆化现象(即模型复制训练中的视觉模板)。

Insight: 生成模型的概念盲点反映了其与真实数据生成过程的差距,通过SAE提取的概念嵌入为模型可解释性提供了新工具。

Abstract: Despite their impressive performance, generative image models trained on large-scale datasets frequently fail to produce images with seemingly simple concepts – e.g., human hands or objects appearing in groups of four – that are reasonably expected to appear in the training data. These failure modes have largely been documented anecdotally, leaving open the question of whether they reflect idiosyncratic anomalies or more structural limitations of these models. To address this, we introduce a systematic approach for identifying and characterizing “conceptual blindspots” – concepts present in the training data but absent or misrepresented in a model’s generations. Our method leverages sparse autoencoders (SAEs) to extract interpretable concept embeddings, enabling a quantitative comparison of concept prevalence between real and generated images. We train an archetypal SAE (RA-SAE) on DINOv2 features with 32,000 concepts – the largest such SAE to date – enabling fine-grained analysis of conceptual disparities. Applied to four popular generative models (Stable Diffusion 1.5/2.1, PixArt, and Kandinsky), our approach reveals specific suppressed blindspots (e.g., bird feeders, DVD discs, and whitespaces on documents) and exaggerated blindspots (e.g., wood background texture and palm trees). At the individual datapoint level, we further isolate memorization artifacts – instances where models reproduce highly specific visual templates seen during training. Overall, we propose a theoretically grounded framework for systematically identifying conceptual blindspots in generative models by assessing their conceptual fidelity with respect to the underlying data-generating process.

cs.AI [Back]

[119] A Comment On “The Illusion of Thinking”: Reframing the Reasoning Cliff as an Agentic Gap

Sheraz Khan,Subha Madhavan,Kannan Natarajan

Main category: cs.AI

TL;DR: 论文评论认为Shojaee等人的研究中观察到的推理模型性能下降并非源于本质的能力限制,而是实验设计的限制。通过引入工具使用,模型能够解决原有设计下无法处理的复杂问题。

Details Motivation: 质疑Shojaee等人提出的推理悬崖现象是否真的反映了模型的内在推理能力限制,而非实验设计的局限性。

Contribution: 将模型表现下降归因为‘执行缺口’(agentic gap),而非推理能力不足,并通过工具使用的实验验证了这一观点。

Method: 通过对比文本生成模式和工具增强模式下的模型表现,展示工具使用对提升模型解决复杂问题能力的影响。

Result: 模型在工具支持下能够超越原有‘推理悬崖’,解决更高复杂度的问题,并展现出多层次推理能力。

Insight: 模型的‘思考’能力可能被低估,关键在于提供足够的执行工具,而非其核心推理能力的不足。

Abstract: The recent work by Shojaee et al. (2025), titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, presents a compelling empirical finding, a reasoning cliff, where the performance of Large Reasoning Models (LRMs) collapses beyond a specific complexity threshold, which the authors posit as an intrinsic scaling limitation of Chain-of-Thought (CoT) reasoning. This commentary, while acknowledging the study’s methodological rigor, contends that this conclusion is confounded by experimental artifacts. We argue that the observed failure is not evidence of a fundamental cognitive boundary, but rather a predictable outcome of system-level constraints in the static, text-only evaluation paradigm, including tool use restrictions, context window recall issues, the absence of crucial cognitive baselines, inadequate statistical reporting, and output generation limits. We reframe this performance collapse through the lens of an agentic gap, asserting that the models are not failing at reasoning, but at execution within a profoundly restrictive interface. We empirically substantiate this critique by demonstrating a striking reversal. A model, initially declaring a puzzle impossible when confined to text-only generation, now employs agentic tools to not only solve it but also master variations of complexity far beyond the reasoning cliff it previously failed to surmount. Additionally, our empirical analysis of tool-enabled models like o4-mini and GPT-4o reveals a hierarchy of agentic reasoning, from simple procedural execution to complex meta-cognitive self-correction, which has significant implications for how we define and measure machine intelligence. The illusion of thinking attributed to LRMs is less a reasoning deficit and more a consequence of an otherwise capable mind lacking the tools for action.

[120] Bayesian Evolutionary Swarm Architecture: A Formal Epistemic System Grounded in Truth-Based Competition

Craig Steven Wright

Main category: cs.AI

TL;DR: 该论文提出了一种基于贝叶斯推理和群体动力学的AI系统框架,通过结构化竞争和信念修正驱动的概率性代理演化。系统将代理的适应性定义为与固定外部真相对齐的函数,并通过竞争推动知识进化。

Details Motivation: 研究动机在于设计一个能够通过竞争和信念修正自我进化的AI系统,并将可验证知识作为进化目标,以实现系统的稳健性和收敛性。

Contribution: 主要贡献包括:1) 提出了基于贝叶斯推理和群体动力学的形式化代理竞争框架;2) 引入了哈希加密身份承诺和因果推断算子;3) 证明了系统在收敛性、鲁棒性和进化稳定性方面的理论结果。

Method: 方法包括:1) 使用贝叶斯推理更新代理的后验信念;2) 通过基于真相对齐的效用比较调整代理评分;3) 采用哈希加密确保身份可追溯性;4) 结合do-calculus进行因果推断。

Result: 结果表明,系统能够将真相作为进化吸引子,通过对抗性认知压力推动可验证知识的涌现,同时保持计算的可行性和自调节性。

Insight: 关键发现是真理可以通过代理间的竞争和信念修正过程自然涌现,且系统的形式化设计能保证收敛性和鲁棒性。

Abstract: We introduce a mathematically rigorous framework for an artificial intelligence system composed of probabilistic agents evolving through structured competition and belief revision. The architecture, grounded in Bayesian inference, measure theory, and population dynamics, defines agent fitness as a function of alignment with a fixed external oracle representing ground truth. Agents compete in a discrete-time environment, adjusting posterior beliefs through observed outcomes, with higher-rated agents reproducing and lower-rated agents undergoing extinction. Ratings are updated via pairwise truth-aligned utility comparisons, and belief updates preserve measurable consistency and stochastic convergence. We introduce hash-based cryptographic identity commitments to ensure traceability, alongside causal inference operators using do-calculus. Formal theorems on convergence, robustness, and evolutionary stability are provided. The system establishes truth as an evolutionary attractor, demonstrating that verifiable knowledge arises from adversarial epistemic pressure within a computable, self-regulating swarm.

[121] Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Liang Zeng,Yongcong Li,Yuzhen Xiao,Changshi Li,Chris Yuhao Liu,Rui Yan,Tianwen Wei,Jujie He,Xuchen Song,Yang Liu,Yahui Zhou

Main category: cs.AI

TL;DR: 该论文提出了一个自动化的数据扩充管道,用于扩展软件工程(SWE)数据集,并通过实验验证了数据规模对LLM性能的提升作用。

Details Motivation: 软件工程(SWE)需要大规模、多样化的数据集来验证LLM的能力,但现有数据集因手动标注和运行时环境配置的限制而规模较小。

Contribution: 1. 提出增量式自动化数据扩充管道;2. 构建了包含10,169个实例的数据集;3. 发现LLM性能随数据规模提升而持续提升的现象。

Method: 通过自动化管道从GitHub提取任务实例,并验证其运行时环境;利用这些数据微调Skywork-SWE模型。

Result: 模型在SWE-bench基准上达到了38.0%的pass@1准确率(未使用验证器或多轮测试),加入测试时扩展技术后提升至47.0%。

Insight: 数据规模的持续扩充对LLM在SWE任务中的性能提升具有显著作用,未观察到性能饱和现象。

Abstract: Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model’s performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.

[122] KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

Baochang Ren,Shuofei Qiao,Wenhao Yu,Huajun Chen,Ningyu Zhang

Main category: cs.AI

TL;DR: KnowRL提出了一种基于知识增强的强化学习方法,通过引入基于知识验证的事实性奖励,减少慢思考模型中的幻觉问题,同时保持其推理能力。

Details Motivation: 大型语言模型(LLMs)在慢思考过程中常因无法准确识别知识边界而产生严重幻觉,强化学习(RL)的奖励机制缺乏对推理过程的事实性监督,加重了这一问题。

Contribution: 1. 提出KnowRL,将知识验证的事实性奖励整合到RL训练中。2. 帮助模型识别知识边界,学习基于事实的推理策略。3. 实验显示,KnowRL显著减少幻觉同时保留推理能力。

Method: 通过知识验证生成事实性奖励,将奖励整合到RL训练过程中,引导模型进行基于事实的慢思考。

Result: 在三个幻觉评估数据集和两个推理评估数据集上,KnowRL有效减少了幻觉,同时保持了模型的推理性能。

Insight: 在RL训练中引入知识验证奖励,能够直接优化模型的推理过程,从而提升其事实性和可靠性。

Abstract: Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.

[123] Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models

Johannes Rückert,Louise Bloch,Christoph M. Friedrich

Main category: cs.AI

TL;DR: 该论文提出了一种利用大型视觉语言模型(VLM)分析科学出版物中的图表是否符合数据可视化指南的方法,并通过实验验证了其在多个任务上的有效性。

Details Motivation: 科学出版物中的图表常因不符合可视化指南而导致信息不准确或不完整,但目前缺乏自动化工具来检测这些问题。

Contribution: 首次将大型视觉语言模型用于自动化检测图表中的潜在问题,并比较了不同模型和提示策略的效果。

Method: 通过五种开源VLM和五种提示策略,针对选定的数据可视化指南问题进行分析,量化评估模型性能。

Result: VLM在检测图表类型、3D效果、坐标轴标签等方面表现良好,但在图像质量和刻度标记检测上效果较差。

Insight: 大型视觉语言模型可以部分替代人工检测图表问题,但仍有改进空间,未来可扩展用于更多可视化场景。

Abstract: Diagrams are widely used to visualize data in publications. The research field of data visualization deals with defining principles and guidelines for the creation and use of these diagrams, which are often not known or adhered to by researchers, leading to misinformation caused by providing inaccurate or incomplete information. In this work, large Vision Language Models (VLMs) are used to analyze diagrams in order to identify potential problems in regards to selected data visualization principles and guidelines. To determine the suitability of VLMs for these tasks, five open source VLMs and five prompting strategies are compared using a set of questions derived from selected data visualization guidelines. The results show that the employed VLMs work well to accurately analyze diagram types (F1-score 82.49 %), 3D effects (F1-score 98.55 %), axes labels (F1-score 76.74 %), lines (RMSE 1.16), colors (RMSE 1.60) and legends (F1-score 96.64 %, RMSE 0.70), while they cannot reliably provide feedback about the image quality (F1-score 0.74 %) and tick marks/labels (F1-score 46.13 %). Among the employed VLMs, Qwen2.5VL performs best, and the summarizing prompting strategy performs best for most of the experimental questions. It is shown that VLMs can be used to automatically identify a number of potential issues in diagrams, such as missing axes labels, missing legends, and unnecessary 3D effects. The approach laid out in this work can be extended for further aspects of data visualization.