Table of Contents

cs.CL [Back]

[1] MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

Ruggero Marino Lazzaroni,Alessandro Angioi,Michelangelo Puliga,Davide Sanna,Roberto Marras

Main category: cs.CL

TL;DR: MedBench-IT 是一个用于评估大语言模型在意大利医学入学考试表现的综合性基准测试,涵盖多个学科和难度级别,并提供了标准化评估方法。

Details Motivation: 当前针对非英语语言和专业领域的大语言模型基准测试稀缺,尤其是意大利语医学教育领域缺乏标准化评估工具。

Contribution: MedBench-IT 是首个专注于意大利医学入学考试的基准测试,包含17,410道专家编写的多选题,并提供多维度评估方法。

Method: 基于Edizioni Simone的材料构建数据集,评估了多款大语言模型,包括GPT-4o和Claude系列,并分析了重现性、排序偏见和问题可读性的影响。

Result: 模型表现因学科和难度而异,重现性达88.86%一致性;问题可读性与性能呈微弱负相关。

Insight: MedBench-IT不仅填补了意大利语医学领域的基准测试空白,还为EdTech开发者提供了实用的模型部署参考。

Abstract: Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three difficulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-efficient open-source alternatives (<30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, offering insights into current capabilities and standardized evaluation methodology for this critical domain.

[2] Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models

Zhiyin Tan,Jennifer D’Souza

Main category: cs.CL

TL;DR: 本研究提出一种基于大语言模型(LLM)的动态主题模型自动化评估框架,解决了传统指标在语义解释上的局限性。

Details Motivation: 主题模型在数字图书馆系统中对学术内容组织和检索至关重要,但传统指标如连贯性和多样性仅能捕捉狭窄的统计模式,难以解释语义失败。

Contribution: 提出一种面向目的的评估框架,包含九种基于LLM的指标,覆盖主题质量的四个维度:词汇有效性、主题内语义合理性、主题间结构合理性和文档-主题对齐合理性。

Method: 通过对抗和抽样验证框架,并在多个数据集(新闻、学术出版物、社交媒体)和主题建模方法上应用。

Result: LLM指标提供了可解释、稳健且任务相关的评估,揭示了传统指标未覆盖的主题模型弱点(如冗余和语义漂移)。

Insight: LLM指标支持开发可扩展、细粒度的评估工具,适用于动态数据集中主题相关性的维护。

Abstract: This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at https://github.com/zhiyintan/topic-model-LLMjudgment.

[3] DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge

Zonghai Yao,Michael Sun,Won Seok Jang,Sunjae Kwon,Soie Kwon,Hong Yu

Main category: cs.CL

TL;DR: DischargeSim 是一个新的基准测试,用于评估大型语言模型(LLM)在出院教育中的能力,重点关注个性化对话、文档生成和患者理解。实验表明,LLM 在这一任务中表现差异显著,模型大小并不总与教育效果成正比。

Details Motivation: 出院沟通是患者护理中至关重要但鲜少研究的环节,现有LLM基准测试未能覆盖出院后的教育支持需求。

Contribution: 提出DischargeSim基准测试,首次评估LLM在出院教育中的表现,涵盖多轮对话、个性化文档和患者理解。

Method: 通过模拟多轮医患对话(涉及多样化的患者心理社会特征),并从对话质量、文档生成和患者理解三个维度进行评估。

Result: 18种LLM在DischargeSim上表现差异显著,模型大小并非教育效果的决定性因素。

Insight: 出院教育需要权衡策略使用和内容优先级,模型的个性化能力比单纯的规模更重要。

Abstract: Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models’ ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.

[4] Rule-Based Moral Principles for Explaining Uncertainty in Natural Language Generation

Zahra Atf,Peter R Lewis

Main category: cs.CL

TL;DR: 论文提出了一种基于规则道德原则的框架,用于处理LLM生成文本中的不确定性,通过道德心理学和美德伦理学的规则(如预防、尊重和责任)来指导系统行为。

Details Motivation: 在LLM应用于高风险场景时,解释不确定性既是技术问题也是伦理问题。传统概率方法往往不透明且难以满足透明度需求。

Contribution: 提出了一种透明的、轻量级的框架,通过道德规则(如预防、尊重和责任)处理LLM生成文本中的不确定性,提升可信度和可解释性。

Method: 利用Prolog引擎编码道德规则(如不确定性水平触发相应行为),并通过情景模拟评估规则覆盖率、公平性和信任校准。

Result: 在临床和法律领域的用例表明,该方法能够有效提升信任和可解释性。

Insight: 道德推理可以作为一种替代概率模型的方法,为社会责任的自然语言生成提供透明且轻量化的解决方案。

Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where explaining uncertainty is both technical and ethical. Probabilistic methods are often opaque and misaligned with expectations of transparency. We propose a framework based on rule-based moral principles for handling uncertainty in LLM-generated text. Using insights from moral psychology and virtue ethics, we define rules such as precaution, deference, and responsibility to guide responses under epistemic or aleatoric uncertainty. These rules are encoded in a lightweight Prolog engine, where uncertainty levels (low, medium, high) trigger aligned system actions with plain-language rationales. Scenario-based simulations benchmark rule coverage, fairness, and trust calibration. Use cases in clinical and legal domains illustrate how moral reasoning can improve trust and interpretability. Our approach offers a transparent, lightweight alternative to probabilistic models for socially responsible natural language generation.

[5] LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

Aida Kostikova,Ole Pütz,Steffen Eger,Olga Sabelfeld,Benjamin Paassen

Main category: cs.CL

TL;DR: 论文利用大语言模型(LLMs)分析了150多年来德国议会关于移民问题的辩论,发现从战后高团结转向近十年的反团结趋势,展示了LLMs在政治文本分析中的潜力。

Details Motivation: 传统上,研究政治演讲需要大量人工标注,限制了分析范围。LLMs可以部分自动化复杂标注任务,从而更高效地分析大规模历史数据。

Contribution: 1) 对多种LLMs在标注德国议会辩论中(反)团结子类型上的性能进行了广泛评估;2) 揭示了德国战后时期对移民的高度团结和近年来反团结的趋势。

Method: 使用多种LLMs(不同模型大小、提示差异、微调等)对议会辩论进行标注,并与数千个人工参考标注进行对比,分析了历史与现代数据的差异及系统误差。

Result: 研究显示德国战后对移民有高度团结,而2015年后反团结趋势显著增强。

Insight: LLMs在政治文本分析中具有潜力,同时德国移民辩论的重要性凸显,尤其是在人口下降、劳动力短缺与日益极化的背景下。

Abstract: Migration has been a core topic in German political debate, from millions of expellees post World War II over labor migration to refugee movements in the recent past. Studying political speech regarding such wide-ranging phenomena in depth traditionally required extensive manual annotations, limiting the scope of analysis to small subsets of the data. Large language models (LLMs) have the potential to partially automate even complex annotation tasks. We provide an extensive evaluation of a multiple LLMs in annotating (anti-)solidarity subtypes in German parliamentary debates compared to a large set of thousands of human reference annotations (gathered over a year). We evaluate the influence of model size, prompting differences, fine-tuning, historical versus contemporary data; and we investigate systematic errors. Beyond methodological evaluation, we also interpret the resulting annotations from a social science lense, gaining deeper insight into (anti-)solidarity trends towards migrants in the German post-World War II period and recent past. Our data reveals a high degree of migrant-directed solidarity in the postwar period, as well as a strong trend towards anti-solidarity in the German parliament since 2015, motivating further research. These findings highlight the promise of LLMs for political text analysis and the importance of migration debates in Germany, where demographic decline and labor shortages coexist with rising polarization.

[6] Causal Attention with Lookahead Keys

Zhuoqing Song,Peng Sun,Huizhuo Yuan,Quanquan Gu

Main category: cs.CL

TL;DR: CASTLE提出了一种新的因果注意力机制,通过动态更新每个token的键(lookahead keys),使其整合后续信息但仍保持自回归性质,同时在语言建模任务中表现优于标准因果注意力。

Details Motivation: 标准因果注意力中,每个token的QKV是静态的,仅编码前文信息。作者希望通过动态更新键整合更多信息以提升模型性能。

Contribution: 提出了CASTLE机制,通过lookahead keys动态整合后续信息,保持自回归性质的同时提升模型表现,并证明其数学等价性以实现高效并行训练。

Method: 利用动态更新的lookahead keys替换静态键,每个token的键随上下文动态变化但仍保持因果性;通过数学等价性避免了显式计算所有位置的lookahead keys。

Result: 在语言建模任务中,CASTLE显著降低了验证困惑度,并在下游任务中表现优于标准因果注意力。

Insight: 动态整合后续信息可通过改进注意力机制实现,同时保持自回归性质;数学等价性设计是实现高效训练的关键。

Abstract: In standard causal attention, each token’s query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token’s keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.

[7] PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions

Yixuan Tang,Yi Yang,Ahmed Abbasi

Main category: cs.CL

TL;DR: PersonaFuse是一个基于人格激活的框架,通过动态路由网络和Mixture-of-Expert架构,让LLM在不同情境下表达适应性人格,显著提升了社交情感智能,同时保持通用推理能力和安全性。

Details Motivation: 现有LLM在情感感知和社交能力上存在局限,无法适应不同情境的沟通风格和情感表达。因此,作者提出PersonaFuse以解决这一问题。

Contribution: 1. 引入基于Trait Activation Theory和Big Five人格模型的框架;2. 结合Mixture-of-Expert架构和动态路由网络;3. 在不牺牲通用能力的前提下提升社交情感智能。

Method: 采用Mixture-of-Expert架构,结合人格适配器和动态路由网络,动态调整LLM的人格表达以适应不同情境。

Result: 实验显示PersonaFuse在社交情感智能上优于基线模型,并在心理健康咨询和客户服务等应用中表现优异,同时保持了模型的安全性和推理能力。

Insight: 通过理论驱动的人格激活机制,PersonaFuse为开发更具人类中心性的AI系统提供了新思路,突出了情境适应能力的重要性。

Abstract: Recent advancements in Large Language Models (LLMs) demonstrate remarkable capabilities across various fields. These developments have led to more direct communication between humans and LLMs in various situations, such as social companionship and psychological support. However, LLMs often exhibit limitations in emotional perception and social competence during real-world conversations. These limitations partly originate from their inability to adapt their communication style and emotional expression to different social and task contexts. In this work, we introduce PersonaFuse, a novel LLM post-training framework that enables LLMs to adapt and express different personalities for varying situations. Inspired by Trait Activation Theory and the Big Five personality model, PersonaFuse employs a Mixture-of-Expert architecture that combines persona adapters with a dynamic routing network, enabling contextual trait expression. Experimental results show that PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence. Importantly, these gains are achieved without sacrificing general reasoning ability or model safety, which remain common limitations of direct prompting and supervised fine-tuning approaches. PersonaFuse also delivers consistent improvements in downstream human-centered applications, such as mental health counseling and review-based customer service. Finally, human preference evaluations against leading LLMs, including GPT-4o and DeepSeek, demonstrate that PersonaFuse achieves competitive response quality despite its comparatively smaller model size. These findings demonstrate that PersonaFuse~offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking a significant advancement toward more human-centric AI systems.

[8] Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents

Sankalp Tattwadarshi Swain,Anshika Krishnatray,Dhruv Kumar,Jagat Sesh Challa

Main category: cs.CL

TL;DR: 论文提出了一个新框架,通过评估LLM智能体在与仅懂新构建语言(Tinkatongue)的机器人对话中的表现,研究其语言习得能力。结果表明,LLM智能体在100次响应内未能成功对话,但其策略反映了人类语言学习的特征。

Details Motivation: 现有研究主要关注词汇学习、形态规则归纳等,但未评估LLM能否通过模式识别和互动反馈习得语言——这是人类语言学习的核心特征。

Contribution: 提出一种新框架,首次评估LLM通过互动反馈习得语言的能力,并揭示了其策略与人类学习的相似性。

Method: 设计实验框架,让LLM智能体与仅懂新语言(Tinkatongue)的机器人对话,观察其语言习得过程。

Result: LLM智能体在100次响应内未能成功对话,但表现出与人类语言学习相似的策略。

Insight: 结果提示需要新的评估基准,并可能推动设计更能从互动反馈中学习的模型。

Abstract: Existing evaluation studies on linguistic competence of large language models (LLM agents) have focused primarily on vocabulary learning, morphological rule induction, syntactic generalization, pragmatic inference, and cross-linguistic transfer. However, none assess whether LLM agents can acquire a language through pattern recognition and interactive feedback, a central feature of human language acquisition. We propose a novel experimental framework in which an LLM agent is evaluated on its ability to acquire and use a newly constructed language (Tinkatongue) in conversation with a bot that understands only Tinkatongue. Our findings show that LLM agents fail to establish a conversation within 100 responses, yet they adopt distinct strategies that mirror human approaches to language learning. The results suggest a new direction for evaluation benchmarks and open pathways to model designs that learn more effectively from interactive feedback.

[9] The Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering

Yi-Jie Cheng,Oscar Chew,Yun-Nung Chen

Main category: cs.CL

TL;DR: 这篇论文研究了在小语言模型(SLM)中通过高效探索模块结合知识图谱(KG)以提升问答任务性能的方法,解决了现有方法依赖大模型的局限。

Details Motivation: 大语言模型(LLM)结合知识图谱可以缓解幻觉问题,但现有方法依赖大型或专有模型,难以普及。因此,研究者探索了如何在小语言模型中高效集成知识图谱。

Contribution: 提出了一种轻量级的探索模块,用于替代语言模型直接进行知识图谱遍历,显著提升了小型模型在知识图谱问答任务中的性能。

Method: 利用简单高效的探索模块,将知识图谱遍历任务从语言模型中分离出来,减轻了小型模型的负担。

Result: 实验结果显示,该方法有效提升了小语言模型在知识图谱问答任务中的表现。

Insight: 轻量级探索模块可以弥补小型模型在知识图谱推理能力上的不足,为资源受限场景提供了可行的解决方案。

Abstract: Integrating knowledge graphs (KGs) into the reasoning processes of large language models (LLMs) has emerged as a promising approach to mitigate hallucination. However, existing work in this area often relies on proprietary or extremely large models, limiting accessibility and scalability. In this study, we investigate the capabilities of existing integration methods for small language models (SLMs) in KG-based question answering and observe that their performance is often constrained by their limited ability to traverse and reason over knowledge graphs. To address this limitation, we propose leveraging simple and efficient exploration modules to handle knowledge graph traversal in place of the language model itself. Experiment results demonstrate that these lightweight modules effectively improve the performance of small language models on knowledge graph question answering tasks. Source code: https://github.com/yijie-cheng/SLM-ToG/.

[10] VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Zheng Wu,Heyuan Huang,Xingyu Lou,Xiangmou Qu,Pengzhou Cheng,Zongru Wu,Weiwen Liu,Weinan Zhang,Jun Wang,Zhaoxiang Wang,Zhuosheng Zhang

Main category: cs.CL

TL;DR: VeriOS提出了一种查询驱动的人-代理-GUI交互框架,通过两阶段学习范式训练可信的操作系统代理,在不影响正常性能的情况下提升不可信场景下的任务成功率。

Details Motivation: 现有操作系统代理在理想化环境下表现良好,但在不可信的真实场景中可能过度执行任务,存在风险,因此需要一种更可靠的交互框架。

Contribution: 1. 提出查询驱动的人-代理-GUI交互框架;2. 设计VeriOS-Agent,通过两阶段学习范式分离和利用元知识;3. 在不影响正常性能的情况下显著提升不可信场景的任务成功率。

Method: 1. 查询驱动的交互框架,代理在不可信场景主动查询人类;2. 两阶段学习范式(元知识分离与利用)。

Result: VeriOS-Agent在不可信场景中将平均每步成功率提升20.64%,且不影响正常性能。

Insight: 1. 主动查询机制增强了代理在不可信环境中的可靠性;2. 元知识分离提升了代理的泛化性和可扩展性。

Abstract: With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 20.64% in untrustworthy scenarios over the state-of-the-art, without compromising normal performance. Analysis highlights VeriOS-Agent’s rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.

[11] Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition

Yi Liu,Xiangrong Zhu,Xiangyu Liu,Wei Wei,Wei Hu

Main category: cs.CL

TL;DR: 论文提出了一种新的知识编辑方法IRAKE,通过引导分解解决多跳问答中的“编辑跳过”问题,显著提升了知识编辑的效果。

Details Motivation: 由于大型语言模型(LLM)的知识迅速过时而重新训练成本高昂,因此迫切需要无需修改参数的动态知识编辑(KE)方法。现有的检索增强生成(RAG)方法虽然在简单知识编辑上表现优异,但在多跳问答中因“编辑跳过”问题而失效。

Contribution: 提出了IRAKE方法,通过单编辑事实和完整案例的引导分解,解决了多跳问答中的编辑跳过问题。实验表明IRAKE优于现有方法。

Method: 利用迭代检索增强生成(RAG)结合分解策略,通过局部事实和全局案例引导模型跳过知识编辑的失败机制。

Result: IRAKE在多跳问答任务中显著减少了编辑跳过问题,性能优于当前最先进方法。

Insight: 知识编辑的粒度与LLM解决问题的方式不匹配是导致编辑跳过的主要原因,局部引导分解可以有效缓解这一问题。

Abstract: In a rapidly evolving world where information updates swiftly, knowledge in large language models (LLMs) becomes outdated quickly. Retraining LLMs is not a cost-effective option, making knowledge editing (KE) without modifying parameters particularly necessary. We find that although existing retrieval-augmented generation (RAG)-based KE methods excel at editing simple knowledge, they struggle with KE in multi-hop question answering due to the issue of “edit skipping”, which refers to skipping the relevant edited fact in inference. In addition to the diversity of natural language expressions of knowledge, edit skipping also arises from the mismatch between the granularity of LLMs in problem-solving and the facts in the edited memory. To address this issue, we propose a novel Iterative Retrieval-Augmented Knowledge Editing method with guided decomposition (IRAKE) through the guidance from single edited facts and entire edited cases. Experimental results demonstrate that IRAKE mitigates the failure of editing caused by edit skipping and outperforms state-of-the-art methods for KE in multi-hop question answering.

[12] BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment

Andrey Sakhovskiy,Elena Tutubalina

Main category: cs.CL

TL;DR: BALI提出了一种新颖的联合语言模型(LM)和知识图谱(KG)预训练方法,通过同时学习专用的KG编码器并对其表示进行对齐,增强了LM的外部知识理解能力。

Details Motivation: 现有生物医学LM在处理复杂领域特定概念结构及知识图谱中的事实信息时表现有限,因此需要一种方法结合KG的外部知识提升LM的能力。

Contribution: 提出了BALI框架,通过KG和LM的对齐预训练,利用UMLS KG的局部子图作为跨模态正样本,显著提升了LM的性能和实体表示质量。

Method: 方法包括:(1)将文本序列中的生物医学概念链接到UMLS KG;(2)利用KG子图作为跨模态正样本;(3)联合训练LM和KG编码器以实现表示对齐。

Result: 实验表明,BALI在PubMedBERT和BioLinkBERT等生物医学LM上显著提升了语言理解任务的性能。

Insight: 即使在小型对齐数据集上预训练,KG与LM的对齐也能显著提升模型理解能力和实体表示质量,表明外部知识整合的重要性。

Abstract: In recent years, there has been substantial progress in using pretrained Language Models (LMs) on a range of tasks aimed at improving the understanding of biomedical texts. Nonetheless, existing biomedical LLMs show limited comprehension of complex, domain-specific concept structures and the factual information encoded in biomedical Knowledge Graphs (KGs). In this work, we propose BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel joint LM and KG pre-training method that augments an LM with external knowledge by the simultaneous learning of a dedicated KG encoder and aligning the representations of both the LM and the graph. For a given textual sequence, we link biomedical concept mentions to the Unified Medical Language System (UMLS) KG and utilize local KG subgraphs as cross-modal positive samples for these mentions. Our empirical findings indicate that implementing our method on several leading biomedical LMs, such as PubMedBERT and BioLinkBERT, improves their performance on a range of language understanding tasks and the quality of entity representations, even with minimal pre-training on a small alignment dataset sourced from PubMed scientific abstracts.

[13] MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

Xixi Wu,Yanchao Tan,Nan Hou,Ruiyang Zhang,Hong Cheng

Main category: cs.CL

TL;DR: MoLoRAG提出了一个逻辑感知的检索框架,通过构建页面图捕获多页文档的上下文关系,结合语义和逻辑相关性提升检索精度,显著提升了多模态文档理解的问答性能。

Details Motivation: 传统方法将文档转换为文本处理后丢失了多模态信息(如图表),而现有的大视觉语言模型(LVLM)因输入限制无法处理多页文档的逻辑关系。MoLoRAG旨在通过逻辑感知的检索解决这一问题。

Contribution: 1. 提出了逻辑感知的检索框架MoLoRAG;2. 构建页面图捕获多页上下文关系;3. 结合语义和逻辑相关性提升检索精度;4. 提供无需训练的部署方案和微调版本。

Method: 1. 构建页面图捕获页面间关系;2. 使用轻量级VLM进行图遍历检索;3. 结合语义和逻辑相关性选择相关页面;4. 将检索结果输入LVLM回答问题。

Result: 在四个DocQA数据集上,MoLoRAG的平均准确率比LVLM直接推理提升9.68%,检索精度比基线高7.44%。

Insight: 逻辑关系在多页文档理解中至关重要,仅依赖语义相关性会导致遗漏关键页面。MoLoRAG的结合方法显著提升了性能和灵活性。

Abstract: Document Understanding is a foundational AI capability with broad applications, and Document Question Answering (DocQA) is a key evaluation task. Traditional methods convert the document into text for processing by Large Language Models (LLMs), but this process strips away critical multi-modal information like figures. While Large Vision-Language Models (LVLMs) address this limitation, their constrained input size makes multi-page document comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate this by selecting relevant pages, but they rely solely on semantic relevance, ignoring logical connections between pages and the query, which is essential for reasoning. To this end, we propose MoLoRAG, a logic-aware retrieval framework for multi-modal, multi-page document understanding. By constructing a page graph that captures contextual relationships between pages, a lightweight VLM performs graph traversal to retrieve relevant pages, including those with logical connections often overlooked. This approach combines semantic and logical relevance to deliver more accurate retrieval. After retrieval, the top-$K$ pages are fed into arbitrary LVLMs for question answering. To enhance flexibility, MoLoRAG offers two variants: a training-free solution for easy deployment and a fine-tuned version to improve logical relevance checking. Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines. Codes and datasets are released at https://github.com/WxxShirley/MoLoRAG.

[14] M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models

Zexuan Li,Hongliang Dai,Piji Li

Main category: cs.CL

TL;DR: 该论文提出了一种名为M-BRe的框架,旨在从无标注文本中自动提取高质量的训练样本以供关系抽取(RE)任务使用。通过结合多类分类和二元分类的优势,解决了大型语言模型(LLMs)在关系语义理解和计算效率上的问题。

Details Motivation: 关系抽取任务中,手动标注训练数据成本高昂,且目标关系在文本中可能非常稀疏。因此,需要一种高效的方法自动从无标注文本中提取训练样本,而大型语言模型虽然在其他NLP任务中表现优异,但在关系抽取中面临语义捕捉不全面和计算开销大的挑战。

Contribution: M-BRe框架通过三个模块(关系分组、关系抽取和标签决策)结合了多类分类和二元分类的优势,显著提升了从无标注文本中发现高质量训练样本的能力。

Method: M-BRe框架包含三个模块:1)关系分组模块,将预定义的关系类别分组以提高语义捕捉效率;2)关系抽取模块,利用LLMs从文本中提取候选关系;3)标签决策模块,结合多类分类和二元分类的结果生成最终标签。

Result: 大量实验验证了M-BRe框架的有效性,其在从无标注文本中发现高质量训练样本方面表现优越。

Insight: M-BRe的创新点在于通过模块化设计平衡了语义理解和计算效率,为关系抽取任务提供了一种低成本、高灵活性的解决方案,尤其是在资源受限的场景中。

Abstract: For Relation Extraction (RE), the manual annotation of training data may be prohibitively expensive, since the sentences that contain the target relations in texts can be very scarce and difficult to find. It is therefore beneficial to develop an efficient method that can automatically extract training instances from unlabeled texts for training RE models. Recently, large language models (LLMs) have been adopted in various natural language processing tasks, with RE also benefiting from their advances. However, when leveraging LLMs for RE with predefined relation categories, two key challenges arise. First, in a multi-class classification setting, LLMs often struggle to comprehensively capture the semantics of every relation, leading to suboptimal results. Second, although employing binary classification for each relation individually can mitigate this issue, it introduces significant computational overhead, resulting in impractical time complexity for real-world applications. Therefore, this paper proposes a framework called M-BRe to extract training instances from unlabeled texts for RE. It utilizes three modules to combine the advantages of both of the above classification approaches: Relation Grouping, Relation Extraction, and Label Decision. Extensive experiments confirm its superior capability in discovering high-quality training samples from unlabeled texts for RE.

[15] Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

Xiaolin Chen,Xuemeng Song,Haokun Wen,Weili Guan,Xiangyu Zhao,Liqiang Nie

Main category: cs.CL

TL;DR: 论文提出了一种双知识增强的两阶段推理器DK2R,通过结合结构化属性和非结构化评论知识,利用大型语言模型提升多模态任务导向对话系统的文本生成能力。

Details Motivation: 现有方法在利用非结构化评论知识和大型语言模型方面存在不足,DK2R旨在通过动态知识类型选择和意图-响应解耦来解决这些问题。

Contribution: 1)提出了结合双知识的两阶段推理框架DK2R;2)动态评估知识类型的效用;3)通过意图导向的线索增强响应生成。

Method: 1)从外部知识库提取结构化属性和非结构化评论知识;2)利用LLM评估知识效用;3)通过专用推理生成意图导向线索,辅助响应生成。

Result: 在公开数据集上的实验验证了DK2R的优越性,代码和参数已开源。

Insight: 双知识(结构化与非结构化)结合和意图解耦是提升多模态对话系统响应质量的关键。

Abstract: Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. Inspired by this, we aim to fully utilize dual knowledge (\textit{i.e., } structured attribute and unstructured review knowledge) with LLMs to promote textual response generation in multimodal task-oriented dialog systems. However, this task is non-trivial due to two key challenges: 1) \textit{dynamic knowledge type selection} and 2) \textit{intention-response decoupling}. To address these challenges, we propose a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for multimodal dialog systems (named DK2R). To be specific, DK2R first extracts both structured attribute and unstructured review knowledge from external knowledge base given the dialog context. Thereafter, DK2R uses an LLM to evaluate each knowledge type’s utility by analyzing LLM-generated provisional probe responses. Moreover, DK2R separately summarizes the intention-oriented key clues via dedicated reasoning, which are further used as auxiliary signals to enhance LLM-based textual response generation. Extensive experiments conducted on a public dataset verify the superiority of DK2R. We have released the codes and parameters.

[16] Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

Mihai Nadas,Laura Diosan,Andreea Tomescu,Andrei Piscoran

Main category: cs.CL

TL;DR: 该论文提出了TINYFABULIST TRANSLATION FRAMEWORK (TF2),一种用于低资源文学翻译的小型开放模型框架。通过两阶段微调和高质量数据集生成,TF2在质量上与大型专有模型竞争,同时更经济、更开放。

Details Motivation: 文学翻译是一个复杂且资源密集的任务,特别是在低资源语言中。现有的大型专有模型成本高昂且不透明,而小型开放模型的性能尚未得到充分验证。TF2旨在填补这一空白。

Contribution: 1. 提出了TF2框架,包含数据集生成、微调和评估流程;2. 发布了高质量数据集合DS-TF2-EN-RO-3M和DS-TF2-EN-RO-15K;3. 通过两阶段微调(指令调优和适配器压缩)提升了小型模型性能;4. 公开所有脚本和评估提示。

Method: 1. 使用高性能LLM生成高质量参考翻译(15k条);2. 对12B参数的开放权重模型进行两阶段微调:指令调优和适配器压缩;3. 结合BLEU和基于LLM的五维评估(准确性、流畅性、连贯性、风格、文化适应性)。

Result: TF2在英语-罗马尼亚文学翻译中表现出色,其流畅性和适切性与大型专有模型相当,同时显著降低了成本。

Insight: 小型开放模型通过高质量数据和针对性微调,可以在特定任务(如文学翻译)中达到与大型模型相近的效果,为低资源语言和文化内容的推广提供了可行方案。

Abstract: Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine tuning, and evaluation in English-Romanian literary translations, centred on the creation and open release of both a compact, fine tuned language model (TF2-12B) and large scale synthetic parallel datasets (DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the largest collection of synthetic English fables to date, we address the need for rich, high quality literary datasets in low resource languages such as Romanian. Our pipeline first generates 15k high quality Romanian references from the TF1 pool using a high performing LLM. We then apply a two stage fine tuning process to a 12B parameter open weight model: (i) instruction tuning to capture genre specific narrative style, and (ii) adapter compression for efficient deployment. Evaluation combines corpus level BLEU and a five dimension LLM based rubric (accuracy, fluency, coherence, style, cultural adaptation) to provide a nuanced assessment of translation quality. Results show that our fine tuned model achieves fluency and adequacy competitive with top performing large proprietary models, while being open, accessible, and significantly more cost effective. Alongside the fine tuned model and both datasets, we publicly release all scripts and evaluation prompts. TF2 thus provides an end-to-end, reproducible pipeline for research on cost efficient translation, cross lingual narrative generation, and the broad adoption of open models for culturally significant literary content in low resource settings.

[17] Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Tong Zheng,Hongming Zhang,Wenhao Yu,Xiaoyang Wang,Xinyu Yang,Runpeng Dai,Rui Liu,Huiwen Bao,Chengsong Huang,Heng Huang,Dong Yu

Main category: cs.CL

TL;DR: Parallel-R1是一个基于强化学习(RL)的框架,旨在通过并行思维提升大语言模型(LLMs)的推理能力。与传统的有监督微调(SFT)不同,Parallel-R1采用渐进式课程学习,先通过SFT在简单任务上培养并行思维,再通过RL在复杂任务上探索和泛化这一能力。实验显示,该框架在数学推理任务上表现优异,显著提升了模型的性能。

Details Motivation: 当前的方法主要依赖有监督微调(SFT)在合成数据上进行训练,这导致模型倾向于模仿而非探索和泛化。因此,作者提出了一种基于强化学习的框架,以激活并行思维能力并解决复杂推理任务。

Contribution: 1. 提出了首个基于强化学习的并行思维框架Parallel-R1;2. 设计了渐进式课程学习策略,解决了并行思维训练中的冷启动问题;3. 通过实验验证了并行思维在推理任务中的有效性。

Method: 1. 使用SFT在简单任务上生成轨迹,培养并行思维;2. 过渡到强化学习,在复杂任务上探索和泛化并行思维;3. 采用多阶段训练策略,分析了并行思维在不同训练阶段的行为变化。

Result: 在MATH、AMC23和AIME等数学基准测试中,Parallel-R1比传统的顺序思维模型提升了8.4%的准确率。最显著的是,并行思维作为一种探索脚手架,在后期RL训练中实现了42.9%的性能提升。

Insight: 并行思维在早期用于探索策略,在后期则用于多视角验证。这种能力可以作为一种临时探索阶段,帮助模型在RL训练中达到更高的性能上限。

Abstract: Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model’s thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

cs.CV [Back]

[18] CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis

Cedric Caruzzo,Jong Chul Ye

Main category: cs.CV

TL;DR: CellPainTR是一种基于Transformer的架构,旨在学习对批效应鲁棒的细胞形态基础表示,能够实现无需微调的跨数据集泛化。

Details Motivation: 大规模生物发现需要整合异构数据集,但技术批效应和缺乏泛化模型是主要障碍。

Contribution: 提出CellPainTR,通过源特定上下文令牌设计,首次实现无需微调的跨数据集泛化模型。

Method: 基于Transformer的架构,引入源特定上下文令牌,学习鲁棒的细胞形态表示。

Result: 在JUMP数据集上优于ComBat和Harmony,在未见的Bray数据集上保持高性能。

Insight: CellPainTR的设计为图像分析提供了通用基础模型,提升了跨研究生物分析的可靠性。

Abstract: Large-scale biological discovery requires integrating massive, heterogeneous datasets like those from the JUMP Cell Painting consortium, but technical batch effects and a lack of generalizable models remain critical roadblocks. To address this, we introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology that are robust to batch effects. Unlike traditional methods that require retraining on new data, CellPainTR’s design, featuring source-specific context tokens, allows for effective out-of-distribution (OOD) generalization to entirely unseen datasets without fine-tuning. We validate CellPainTR on the large-scale JUMP dataset, where it outperforms established methods like ComBat and Harmony in both batch integration and biological signal preservation. Critically, we demonstrate its robustness through a challenging OOD task on the unseen Bray et al. dataset, where it maintains high performance despite significant domain and feature shifts. Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.

[19] FusWay: Multimodal hybrid fusion approach. Application to Railway Defect Detection

Alexey Zhukov,Jenny Benois-Pineau,Amira Youssef,Akka Zemmari,Mohamed Mosbah,Virginie Taillandier

Main category: cs.CV

TL;DR: 论文提出了一种基于YOLOv8n和Vision Transformer的多模态融合架构FusWay,通过结合图像和音频数据,提升了铁路缺陷检测的精度。

Details Motivation: 传统的单模态检测方法(如YOLO)在铁路缺陷检测中存在过度检测问题,尤其是在结构元素相似的情况下,因此需要结合多模态数据以提高准确性。

Contribution: 提出了一种新型多模态融合架构FusWay,结合YOLOv8n和Vision Transformer,实现了图像和音频数据的有效融合,提升了检测精度。

Method: 方法基于域规则设计,使用YOLOv8n进行快速目标检测,并结合Vision Transformer提取多层级特征(7、16、19层)和音频表征,实现跨模态融合。

Result: 实验结果表明,相较于纯视觉方法,多模态融合方法的精度和整体准确率提升了0.2个百分点,且t检验证实了差异的统计显著性。

Insight: 多模态融合能够弥补单模态方法的局限性,尤其适用于视觉信息不足以区分类似结构元素的场景。

Abstract: Multimodal fusion is a multimedia technique that has become popular in the wide range of tasks where image information is accompanied by a signal/audio. The latter may not convey highly semantic information, such as speech or music, but some measures such as audio signal recorded by mics in the goal to detect rail structure elements or defects. While classical detection approaches such as You Only Look Once (YOLO) family detectors can be efficiently deployed for defect detection on the image modality, the single modality approaches remain limited. They yield an overdetection in case of the appearance similar to normal structural elements. The paper proposes a new multimodal fusion architecture built on the basis of domain rules with YOLO and Vision transformer backbones. It integrates YOLOv8n for rapid object detection with a Vision Transformer (ViT) to combine feature maps extracted from multiple layers (7, 16, and 19) and synthesised audio representations for two defect classes: rail Rupture and Surface defect. Fusion is performed between audio and image. Experimental evaluation on a real-world railway dataset demonstrates that our multimodal fusion improves precision and overall accuracy by 0.2 points compared to the vision-only approach. Student’s unpaired t-test also confirms statistical significance of differences in the mean accuracy.

[20] FedAPT: Federated Adversarial Prompt Tuning for Vision-Language Models

Kun Zhai,Siheng Chen,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 该论文提出了一种名为FedAPT的新方法,用于提高联邦提示调优(FPT)在视觉-语言模型中的对抗鲁棒性,解决了非独立同分布(non-IID)设置下的类别信息缺失问题。

Details Motivation: 现有的联邦提示调优方法在对抗攻击下表现脆弱,尤其在非IID设置中,客户端的本地标签信息有限,导致全球模型的对抗鲁棒性不足。

Contribution: 1.提出FedAPT方法,增强FPT的对抗鲁棒性。2.设计类感知提示生成器和全球标签嵌入机制,解决类别信息缺失问题。3.引入跨层生成器共享策略,提升模型各层之间的提示耦合。

Method: 1.使用类感知提示生成器从文本提示生成视觉提示。2.全球标签嵌入作为引导,整合跨客户端标签信息。3.跨层生成器共享策略强化各层的提示耦合。

Result: 在多个图像分类数据集上,FedAPT显著优于现有方法,展示了在对抗鲁棒性和跨域/跨数据集泛化能力方面的优势。

Insight: 解决非IID设置下的类别信息缺失对联邦学习的对抗鲁棒性至关重要,跨层提示耦合和全局标签嵌入是提升效果的关键因素。

Abstract: Federated Prompt Tuning (FPT) is an efficient method for cross-client collaborative fine-tuning of large Vision-Language Models (VLMs). However, models tuned using FPT are vulnerable to adversarial attacks, leading to misclassification in downstream tasks. In this work, we introduce Federated Adversarial Prompt Tuning (\textbf{FedAPT}), a novel method designed to enhance the adversarial robustness of FPT. We identify a key issue in FedAPT under non-independent and identically distributed (non-IID) settings: a \textit{class information gap} between clients and the global model. Clients rely solely on limited local label information to generate adversarial samples for training, while the global model must defend against adversarial attacks from global labels. To address this issue, we propose a \textbf{class-aware prompt generator} that generates visual prompts from text prompts. This generator is guided by a \emph{Global Label Embedding} (serving as a ``beacon”) which encodes cross-client label information to create more globally-aligned visual prompts. Additionally, we propose a \textbf{cross-layer generator sharing} strategy to enhance prompt coupling across different layers of the model, further boosting adversarial robustness. Extensive experiments on multiple image classification datasets demonstrate the superiority of FedAPT in improving adversarial robustness, outperforming existing methods by a large margin. FedAPT also exhibits exceptional generalization in cross-domain and cross-dataset scenarios, indicating its effectiveness in real-world applications.

[21] Geospatial Foundational Embedder: Top-1 Winning Solution on EarthVision Embed2Scale Challenge (CVPR 2025)

Zirui Xu,Raphael Tang,Mike Bianco,Qi Zhang,Rishi Madhok,Nikolaos Karianakis,Fuxun Yu

Main category: cs.CV

TL;DR: 本文介绍了CVPR 2025 EarthVision Embed2Scale挑战赛的Top-1获胜解决方案——Geospatial Foundational Embedder,旨在为SSL4EO-S12高光谱地理空间数据立方体生成嵌入向量,以支持下游任务如分类和回归。

Details Motivation: 解决高光谱地理空间数据的嵌入问题,以支持高效的分类和回归任务。

Contribution: 提出Geospatial Foundational Embedder方法,在Embed2Scale挑战赛中取得Top-1成绩。

Method: 使用自监督学习(SSL)技术从SSL4EO-S12数据生成嵌入向量,针对高光谱数据的特点优化嵌入过程。

Result: 在挑战赛中表现最佳,生成的嵌入向量能够有效支持下游任务。

Insight: 高光谱地理空间数据的嵌入模型可以通过自监督学习实现高性能,同时为下游任务提供通用支持。

Abstract: EarthVision Embed2Scale challenge (CVPR 2025) aims to develop foundational geospatial models to embed SSL4EO-S12 hyperspectral geospatial data cubes into embedding vectors that faciliatetes various downstream tasks, e.g., classification, regression, etc. In this technical report, we introduce our proposed method for the Top-1 winning solution on the Embed2Scale Challenge.

[22] VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality

Srihari Bandraupalli,Anupam Purwar

Main category: cs.CV

TL;DR: 该论文提出了ViLD框架,用于弥合学术基准与企业需求之间的差距,通过定义10个关键任务并引入创新的BlockWeaver算法,对开源视觉语言模型(VLM)进行了实际企业环境下的评估。

Details Motivation: 现有学术基准依赖选择题和合成数据,无法反映企业应用的复杂性,如社交媒体内容分析。论文旨在解决这一差距。

Contribution: 1. 提出ViLD框架,定义10个企业关键任务;2. 开发BlockWeaver算法,高效处理无序OCR输出;3. 构建包含7500个真实样本的基准数据集。

Method: 1. 使用BlockWeaver算法比较无序OCR输出;2. 结合语义匹配、传统指标和新型方法评估模型输出的完整性和忠实性。

Result: 通过ViLD框架评估了Qwen、MIMO和InternVL等开源VLM模型,并对比专有基线模型,提供了任务驱动的能力评估。

Insight: ViLD为企业环境下的VLM部署提供了实际指导,强调了真实数据和任务多样性在评估中的重要性。

Abstract: Open-source Vision-Language Models show immense promise for enterprise applications, yet a critical disconnect exists between academic evaluation and enterprise deployment requirements. Current benchmarks rely heavily on multiple-choice questions and synthetic data, failing to capture the complexity of real-world business applications like social media content analysis. This paper introduces VLM-in-the-Wild (ViLD), a comprehensive framework to bridge this gap by evaluating VLMs on operational enterprise requirements. We define ten business-critical tasks: logo detection, OCR, object detection, human presence and demographic analysis, human activity and appearance analysis, scene detection, camera perspective and media quality assessment, dominant colors, comprehensive description, and NSFW detection. To this framework, we bring an innovative BlockWeaver Algorithm that solves the challenging problem of comparing unordered, variably-grouped OCR outputs from VLMs without relying on embeddings or LLMs, achieving remarkable speed and reliability. To demonstrate efficacy of ViLD, we constructed a new benchmark dataset of 7,500 diverse samples, carefully stratified from a corpus of one million real-world images and videos. ViLD provides actionable insights by combining semantic matching (both embedding-based and LLM-as-a-judge approaches), traditional metrics, and novel methods to measure the completeness and faithfulness of descriptive outputs. By benchmarking leading open-source VLMs (Qwen, MIMO, and InternVL) against a powerful proprietary baseline as per ViLD framework, we provide one of the first industry-grounded, task-driven assessment of VLMs capabilities, offering actionable insights for their deployment in enterprise environments.

[23] The Protocol Genome A Self Supervised Learning Framework from DICOM Headers

Jimmy Joseph

Main category: cs.CV

TL;DR: 论文提出了Protocol Genome,一种从DICOM头部自监督学习的框架,提升了医学影像任务的性能和校准性。

Details Motivation: 医学影像中的DICOM头部包含丰富的协议信息,但这些潜在干扰因素影响了仅基于图像的网络的泛化能力。

Contribution: 提出了结合DICOM头部信息的自监督学习方法,通过协议-图像对比学习、掩码协议预测和协议-协议翻译提升模型性能。

Method: 方法包括:(1) 协议-图像对比学习,(2) 掩码协议预测,(3) 协议-协议翻译,结合DICOM头部和图像特征建模。

Result: 在多个任务中(如肺栓塞CT分类、脑胶质瘤MRI分级、心胸比例检测),Protocol Genome显著提升了外部验证的AUROC和校准性。

Insight: DICOM头部信息有助于模型泛化,且在小样本场景下仍能保持性能提升,对临床部署有实际价值。

Abstract: In this paper, we introduce the Protocol Genome, a self-supervised learning system that learns correlations from DICOM headers and achieves AUROC 0.901 (vs 0.847 baseline) and ECE 0.036 (vs 0.058) on fully held-out external validation. Our method also improves calibration and robustness across modalities (CT, MRI, CXR) and vendors. Clinical imaging is funneled through PACS/DICOM, where procedure choices (scanner make/model, sequence, kernel, kVp, TR/TE, and slice thickness) have consequences for contrast, noise, and artifact. These latent confounders impede the generalization of image-only networks across sites. We consider structured DICOM headers as a label and learn protocol-aware but clinically robust image representations. Protocol Genome obtains tokenized embeddings of de-identified header fields and models them along with image features using: (1) protocol-image contrastive learning, (2) masked protocol prediction, and (3) protocol-protocol translation. With 1.26M studies (7 health systems, 31 scanners, 3 vendors; CT, MR, CR/DR), we experiment on: (A) chest CT triage for PE, (B) brain MRI glioma grading, and (C) chest radiograph cardiomegaly detection. Relative to strong SSL baselines (SimCLR, MAE) as well as ImageNet transfer, Protocol Genome (+0.046: PE, +0.058: glioma, +0.041: cardiomegaly) is associated with higher external AUROC; 25-37% calibration improvements are obtained (p < 0.01, DeLong tests). While the gains may be task-dependent, they are preserved with 10-20% of labeled data. From a clinical point of view, the technique reduces false positives at protocol borders and is applicable in a PACS (DICOM C-FIND/C-MOVE, DICOMweb QIDO/WADO). We publish a model card and deployment guide, complete with both de-identification and bias audits.

[24] Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems

Jie Zhang,Ting Xu,Gelei Deng,Runyi Hu,Han Qiu,Tianwei Zhang,Qing Guo,Ivor Tsang

Main category: cs.CV

TL;DR: 论文研究了视觉语言模型(VLMs)对人类书写系统(中英文)的符号识别能力,发现尽管人类对部分遮挡或组合的字符具有高度适应性,但VLMs在这些情况下表现大幅下降,揭示了模型对视觉不变性的依赖过多,而对组合性先验的利用不足。

Details Motivation: 探讨VLMs是否具备人类在面对符号部分遮挡、组合等情况时的识别能力,以评估其在教育、无障碍等领域应用的可靠性。

Contribution: 1. 构建了两个跨书写系统的基准测试(中文和英文),测试VLMs对人眼可读但模型难以识别的刺激的反应;2. 揭示了VLMs在特定视觉扰动下的性能局限;3. 提出了未来模型改进的方向。

Method: 1. 通过分割、重组和叠加字形生成“可见但不可读”的刺激;2. 设计实验对比模型与人类在不同书写系统中的表现;3. 定量和定性分析模型的输出结果。

Result: VLMs在干净文本上表现良好,但在扰动文本中表现显著下降,常产生无关或混乱的输出。这表明模型对视觉不变性的依赖过多,缺乏组合性先验。

Insight: 未来VLMs需更注重符号分割、组合和绑定能力的训练,以提升在复杂视觉场景中的鲁棒性。这可能为其在教育、无障碍和文化遗产等领域的应用铺平道路。

Abstract: Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ‘’visible but unreadable’’ stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.

[25] Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

Liviu Nicolae Fircă,Antonio Bărbălau,Dan Oneata,Elena Burceanu

Main category: cs.CV

TL;DR: 论文研究了模型是否能将属性知识泛化到语义和感知上不相关的类别中,发现现有模型在训练和测试类别相关性降低时性能显著下降,表明其对分割设计非常敏感。

Details Motivation: 探讨模型能否跨越语义和感知上不相关的类别泛化属性知识,填补现有研究在概念距离较远的类别间属性推理的空白。

Contribution: 首次明确评估了在训练和测试类别相关性降低时属性预测任务的鲁棒性,并提出多种分割策略以减少隐藏相关性。

Method: 引入基于LLM的语义分组、嵌入相似性阈值、嵌入聚类和超类别分割等策略,逐步减少训练和测试集之间的相关性。

Result: 结果显示性能随着训练和测试类别相关性降低而显著下降,聚类方法在减少隐藏相关性和保持可学习性之间达成了最佳折衷。

Insight: 研究表明当前模型在属性推理中存在局限性,强调未来构建属性推理基准时需要更谨慎地设计分割策略。

Abstract: Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute “has four legs” is common to both “dogs” and “chairs”. To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.

[26] Human-in-the-Loop: Quantitative Evaluation of 3D Models Generation by Large Language Models

Ahmed R. Sadik,Mariusz Bujny

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于人类反馈的定量评估框架,用于评估大语言模型生成的3D模型的几何和结构保真度,并通过实验验证了其在CAD设计中的有效性。

Details Motivation: 现有的大语言模型在多模态输入下生成复杂3D模型的能力不断增强,但缺乏稳健的方法来定量评估生成模型的几何和结构保真度。

Contribution: 论文的主要贡献是提出了一套全面的定量评估指标(如体积精度、表面对齐等),并通过实验证明了其在加速模型向真实值收敛方面的有效性。

Method: 作者提出了一种人类参与的评估框架,结合多种输入模态(如2D正交视图、代码提示等),并设计了一套相似性和复杂性指标来评估模型生成质量。

Result: 实验结果表明,语义更丰富的输入(如代码提示)能够实现更高质量的模型生成,并且在所有指标上达到完美重构。

Insight: 论文揭示了定量评估方法的优势,不仅能有效验证生成模型的质量,还为CAD应用的模型优化提供了可扩展的方法。

Abstract: Large Language Models are increasingly capable of interpreting multimodal inputs to generate complex 3D shapes, yet robust methods to evaluate geometric and structural fidelity remain underdeveloped. This paper introduces a human in the loop framework for the quantitative evaluation of LLM generated 3D models, supporting applications such as democratization of CAD design, reverse engineering of legacy designs, and rapid prototyping. We propose a comprehensive suite of similarity and complexity metrics, including volumetric accuracy, surface alignment, dimensional fidelity, and topological intricacy, to benchmark generated models against ground truth CAD references. Using an L bracket component as a case study, we systematically compare LLM performance across four input modalities: 2D orthographic views, isometric sketches, geometric structure trees, and code based correction prompts. Our findings demonstrate improved generation fidelity with increased semantic richness, with code level prompts achieving perfect reconstruction across all metrics. A key contribution of this work is demonstrating that our proposed quantitative evaluation approach enables significantly faster convergence toward the ground truth, especially compared to traditional qualitative methods based solely on visual inspection and human intuition. This work not only advances the understanding of AI assisted shape synthesis but also provides a scalable methodology to validate and refine generative models for diverse CAD applications.

[27] MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

Jiarui Chen,Yikeng Chen,Yingshuang Zou,Ye Huang,Peng Wang,Yuan Liu,Yujing Sun,Wenping Wang

Main category: cs.CV

TL;DR: MEGS²通过使用球面高斯函数和统一的剪枝方法,显著降低了3D高斯泼溅的内存消耗,同时保持了渲染质量。

Details Motivation: 现有3D高斯泼溅方法内存消耗高,限制了其在边缘设备上的应用。现有压缩方法多关注存储压缩,而未解决渲染内存的关键瓶颈。

Contribution: 提出了MEGS²框架,通过优化图元数量和参数数量,实现了前所未有的内存压缩;提出轻量级球面高斯颜色表示和统一的软剪枝方法。

Method: 1. 使用球面高斯函数替代高内存消耗的球谐函数;2. 提出统一的软剪枝框架,将图元数量和瓣数量剪枝建模为单一约束优化问题。

Result: MEGS²实现了50%的静态VRAM减少和40%的渲染VRAM减少,同时渲染质量与传统方法相当。

Insight: 通过联合优化图元和参数数量,可以在保持质量的同时显著降低内存需求;轻量级表示和软剪枝是未来3DGS压缩的重要方向。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis technique, but its high memory consumption severely limits its applicability on edge devices. A growing number of 3DGS compression methods have been proposed to make 3DGS more efficient, yet most only focus on storage compression and fail to address the critical bottleneck of rendering memory. To address this problem, we introduce MEGS$^{2}$, a novel memory-efficient framework that tackles this challenge by jointly optimizing two key factors: the total primitive number and the parameters per primitive, achieving unprecedented memory compression. Specifically, we replace the memory-intensive spherical harmonics with lightweight arbitrarily-oriented spherical Gaussian lobes as our color representations. More importantly, we propose a unified soft pruning framework that models primitive-number and lobe-number pruning as a single constrained optimization problem. Experiments show that MEGS$^{2}$ achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods, while maintaining comparable rendering quality.

[28] Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models

Jisung Hwang,Jaihoon Kim,Minhyuk Sung

Main category: cs.CV

TL;DR: 提出了一种新的正则化损失,通过结合空间域的矩正则化和频谱域的功率谱正则化,强制文本到图像模型的潜在空间样本符合标准高斯分布,从而优化下游任务。

Details Motivation: 现有高斯性正则化方法在文本到图像模型的潜在空间中表现不足,无法高效地约束样本分布,限制了优化任务的效果。

Contribution: 提出了一个统一框架,将矩正则化和功率谱正则化结合,降低计算复杂度并提升效果。

Method: 通过空间域的高阶矩和频谱域的功率谱分布约束样本的高斯性,并引入随机排列保证置换不变性。

Result: 正则化方法在生成建模中表现优越,加速了收敛并有效防止奖励破解。

Insight: 结合空间域和频谱域的正则化可更全面地约束高斯性,提高模型在潜在空间优化任务中的效率。

Abstract: We propose a novel regularization loss that enforces standard Gaussianity, encouraging samples to align with a standard Gaussian distribution. This facilitates a range of downstream tasks involving optimization in the latent space of text-to-image models. We treat elements of a high-dimensional sample as one-dimensional standard Gaussian variables and define a composite loss that combines moment-based regularization in the spatial domain with power spectrum-based regularization in the spectral domain. Since the expected values of moments and power spectrum distributions are analytically known, the loss promotes conformity to these properties. To ensure permutation invariance, the losses are applied to randomly permuted inputs. Notably, existing Gaussianity-based regularizations fall within our unified framework: some correspond to moment losses of specific orders, while the previous covariance-matching loss is equivalent to our spectral loss but incurs higher time complexity due to its spatial-domain computation. We showcase the application of our regularization in generative modeling for test-time reward alignment with a text-to-image model, specifically to enhance aesthetics and text alignment. Our regularization outperforms previous Gaussianity regularization, effectively prevents reward hacking and accelerates convergence.

[29] SAM$^{*}$: Task-Adaptive SAM with Physics-Guided Rewards

Kamyar Barakati,Utkarsh Pratiush,Sheryl L. Sanchez,Aditya Raghavan,Delia J. Milliron,Mahshid Ahmadi,Philip D. Rack,Sergei V. Kalinin

Main category: cs.CV

TL;DR: 论文提出了一种基于奖励函数的优化方法SAM$^{*}$,通过物理引导的奖励增强SAM模型在显微镜图像分割中的适应性和实时性能。

Details Motivation: 基础模型(如SAM)虽然通用性强,但其大量不透明的调参限制了实时流数据分析的应用,尤其是在显微镜图像分割这类需要高精度的任务中。

Contribution: 提出了SAM$^{*}$,一种通过物理引导的奖励函数优化SAM模型的方法,显著提高了其在多样化分割任务中的表现和实时性。

Method: 利用物理系统的先验知识(如颗粒大小分布、几何特征等)设计奖励函数,并结合优化框架对SAM进行微调。

Result: 在显微镜图像分割任务中,SAM$^{*}$表现出更高的精确性和适应性,特别是在实时流数据场景中。

Insight: 奖励函数的引入可以将领域知识(如物理模型)无缝集成到模型优化中,从而提升基础模型在特定任务中的实用性。

Abstract: Image segmentation is a critical task in microscopy, essential for accurately analyzing and interpreting complex visual data. This task can be performed using custom models trained on domain-specific datasets, transfer learning from pre-trained models, or foundational models that offer broad applicability. However, foundational models often present a considerable number of non-transparent tuning parameters that require extensive manual optimization, limiting their usability for real-time streaming data analysis. Here, we introduce a reward function-based optimization to fine-tune foundational models and illustrate this approach for SAM (Segment Anything Model) framework by Meta. The reward functions can be constructed to represent the physics of the imaged system, including particle size distributions, geometries, and other criteria. By integrating a reward-driven optimization framework, we enhance SAM’s adaptability and performance, leading to an optimized variant, SAM$^{*}$, that better aligns with the requirements of diverse segmentation tasks and particularly allows for real-time streaming data segmentation. We demonstrate the effectiveness of this approach in microscopy imaging, where precise segmentation is crucial for analyzing cellular structures, material interfaces, and nanoscale features.

[30] Enhancing Classification of Streaming Data with Image Distillation

Rwad Khatib,Yehudit Aperstein

Main category: cs.CV

TL;DR: 本文提出了一种基于数据蒸馏的图像流数据分类方法DBC,在有限资源和复杂数据流环境下,显著提升了分类准确率至73.1%,优于传统方法和RBC技术。

Details Motivation: 在内存和计算资源受限的环境中,如何高效分类流式图像数据是一个挑战。传统方法如Hoeffding Trees和随机森林需要适应图像数据,而数据蒸馏为这一问题提供了创新解决方案。

Contribution: 主要贡献是提出了Distillation Based Classification (DBC)方法,通过蒸馏关键特征降低计算需求,同时保持分类精度,显著优于传统方法和Reservoir Sampling Based Classification。

Method: 基于数据蒸馏,从流数据中提取核心特征以减少计算负担,对比测试了Hoeffding Trees、自适应随机森林以及RBC技术。

Result: DBC方法实现了73.1%的分类准确率,超越了其他对比方法。

Insight: 数据蒸馏在处理复杂流数据时具有潜力,能够平衡精度与资源效率,为实时图像分类提供了新思路。

Abstract: This study tackles the challenge of efficiently classifying streaming data in envi-ronments with limited memory and computational resources. It delves into the application of data distillation as an innovative approach to improve the precision of streaming image data classification. By focusing on distilling essential features from data streams, our method aims to minimize computational demands while preserving crucial information for accurate classification. Our investigation com-pares this approach against traditional algorithms like Hoeffding Trees and Adap-tive Random Forest, adapted through embeddings for image data. The Distillation Based Classification (DBC) demonstrated superior performance, achieving a 73.1% accuracy rate, surpassing both traditional methods and Reservoir Sam-pling Based Classification (RBC) technique. This marks a significant advance-ment in streaming data classification, showcasing the effectiveness of our method in processing complex data streams and setting a new standard for accuracy and efficiency.

[31] Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

Juan Manuel Contreras

Main category: cs.CV

TL;DR: 该论文提出了Aymara Image Fairness Evaluation基准,用于评估13个大型多模态模型中的性别偏见,发现这些模型不仅复制还放大了职业性别刻板印象,且不同模型的偏见程度差异显著。

Details Motivation: 大型多模态模型(LMMs)在文本生成图像方面表现突出,但它们可能延续训练数据中的社会偏见。先前研究发现了性别偏见问题,但缺乏大规模、可比较的跨模型分析。

Contribution: 论文的主要贡献是提出了一个自动化评估性别偏见的基准,并通过对13个商业LMMs的测试,揭示了其对职业性别刻板印象的复制和放大现象。

Method: 研究使用75个程序生成的性别中立提示,生成男性刻板化、女性刻板化和非刻板化职业的图像,并通过验证过的LLM评分系统对965张图像进行性别代表性评分。

Result: 结果显示,LMMs对职业性别刻板印象有显著的放大效应(p < .001),不同模型的偏见程度差异显著(男性代表性46.7%到73.3%),表现最佳的模型接近性别平等。

Insight: 研究指出高偏见并非不可避免,而是设计选择的结果,强调了标准化自动化评估工具的必要性。

Abstract: Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data. Prior work has identified gender bias in these models, but methodological limitations prevented large-scale, comparable, cross-model analysis. To address this gap, we introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images. We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions. We then use a validated LLM-as-a-judge system to score the 965 resulting images for gender representation. Our results reveal (p < .001 for all): 1) LMMs systematically not only reproduce but actually amplify occupational gender stereotypes relative to real-world labor data, generating men in 93.0% of images for male-stereotyped professions but only 22.5% for female-stereotyped professions; 2) Models exhibit a strong default-male bias, generating men in 68.3% of the time for non-stereotyped professions; and 3) The extent of bias varies dramatically across models, with overall male representation ranging from 46.7% to 73.3%. Notably, the top-performing model de-amplified gender stereotypes and approached gender parity, achieving the highest fairness scores. This variation suggests high bias is not an inevitable outcome but a consequence of design choices. Our work provides the most comprehensive cross-model benchmark of gender bias to date and underscores the necessity of standardized, automated evaluation tools for promoting accountability and fairness in AI development.

[32] Faster VGGT with Block-Sparse Global Attention

Chung-Shien Brian Wang,Christian Schmidt,Jens Piekenbrinck,Bastian Leibe

Main category: cs.CV

TL;DR: 论文提出了一种基于块稀疏全局注意力机制的优化方法,显著提升了VGGT和π3模型的推理速度,同时保持了任务性能。

Details Motivation: 基于Transformer的VGGT和π3模型在多视图重建任务中表现优异,但全局注意力层的二次复杂度限制了其在大型图像集合中的可扩展性。论文通过分析全局注意力矩阵,发现注意力主要集中在少数与跨视图几何匹配相关的块-块交互上。

Contribution: 1. 提出了一种优化的块稀疏全局注意力机制,替代密集全局注意力操作,实现高达4倍的推理速度提升;2. 该方法无需重新训练骨干网络,兼容VGGT和π3模型,并支持大型图像集合。

Method: 1. 分析全局注意力矩阵的稀疏性;2. 设计并实现高效的块稀疏核操作,替换原始的密集全局注意力层。

Result: 在多视图基准测试中,该方法显著提升了推理速度(最高4倍),同时保持了与原始方法相当的任务性能。

Insight: 全局注意力矩阵中存在显著的结构化稀疏性,可以通过高效的块稀疏操作利用这一点来加速Transformer模型的计算。

Abstract: Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and $\pi^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to $4\times$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $\pi^3$, and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

[33] Dimensionally Reduced Open-World Clustering: DROWCULA

Erencem Ozbey,Dimitrios I. Diochnos

Main category: cs.CV

TL;DR: DROWCULA提出了一种完全无监督的方法来处理开放世界中的聚类问题,利用Vision Transformers和流形学习技术提升聚类性能,并在多个数据集上取得了SOTA结果。

Details Motivation: 在开放世界场景中,新类别的出现使得传统的监督学习方法难以应对,而无监督方法避免了标注数据的巨大开销。

Contribution: 提出了DROWCULA方法,结合Vision Transformers和流形学习技术,实现了完全无监督的图像聚类和新类别发现。

Method: 使用Vision Transformers生成向量嵌入,并通过流形学习技术优化这些嵌入,以提升聚类效果。

Result: 在CIFAR-10、CIFAR-100、ImageNet-100和Tiny ImageNet等数据集上实现了SOTA性能。

Insight: 无监督方法在开放世界问题中具有潜力,结合Vision Transformers和流形学习可以有效提升聚类性能。

Abstract: Working with annotated data is the cornerstone of supervised learning. Nevertheless, providing labels to instances is a task that requires significant human effort. Several critical real-world applications make things more complicated because no matter how many labels may have been identified in a task of interest, it could be the case that examples corresponding to novel classes may appear in the future. Not unsurprisingly, prior work in this, so-called, `open-world’ context has focused a lot on semi-supervised approaches. Focusing on image classification, somehow paradoxically, we propose a fully unsupervised approach to the problem of determining the novel categories in a particular dataset. Our approach relies on estimating the number of clusters using Vision Transformers, which utilize attention mechanisms to generate vector embeddings. Furthermore, we incorporate manifold learning techniques to refine these embeddings by exploiting the intrinsic geometry of the data, thereby enhancing the overall image clustering performance. Overall, we establish new State-of-the-Art results on single-modal clustering and Novel Class Discovery on CIFAR-10, CIFAR-100, ImageNet-100, and Tiny ImageNet. We do so, both when the number of clusters is known or unknown ahead of time. The code is available at: https://github.com/DROWCULA/DROWCULA.

[34] XBusNet: Text-Guided Breast Ultrasound Segmentation via Multimodal Vision-Language Learning

Raja Mallina,Bryar Shareef

Main category: cs.CV

TL;DR: XBusNet是一种基于多模态视觉-语言学习的双提示、双分支模型,旨在通过结合图像特征和临床文本提示实现精确的乳房超声分割,尤其改善了小或低对比度病变的分割效果。

Details Motivation: 乳房超声(BUS)分割对于定量分析和分类至关重要,但小或低对比度病变的分割仍然具有挑战性。传统方法难以处理模糊边界和噪声,而文本提示可以提供临床上下文,但现有方法通常导致粗糙的分割结果。

Contribution: XBusNet引入了一种双提示、双分支的多模态模型,结合了全局语义和局部边界建模,并通过自动生成的文本提示优化分割效果,显著提升了小病变的分割性能。

Method: 模型包括两条路径:全局路径基于CLIP Vision Transformer,编码整幅图像的语义;局部路径基于U-Net,专注于精确边界。文本提示自动从结构化元数据生成,无需手动输入。

Result: 在BLU数据集上,XBusNet达到了Dice分数0.8765和IoU分数0.8149,优于六种基线方法,尤其在小病变上表现突出。

Insight: 全局语义和局部边界建模的结合是关键,文本提示的引入进一步优化了分割效果,为小和低对比度病变的分割提供了新思路。

Abstract: Background: Precise breast ultrasound (BUS) segmentation supports reliable measurement, quantitative analysis, and downstream classification, yet remains difficult for small or low-contrast lesions with fuzzy margins and speckle noise. Text prompts can add clinical context, but directly applying weakly localized text-image cues (e.g., CAM/CLIP-derived signals) tends to produce coarse, blob-like responses that smear boundaries unless additional mechanisms recover fine edges. Methods: We propose XBusNet, a novel dual-prompt, dual-branch multimodal model that combines image features with clinically grounded text. A global pathway based on a CLIP Vision Transformer encodes whole-image semantics conditioned on lesion size and location, while a local U-Net pathway emphasizes precise boundaries and is modulated by prompts that describe shape, margin, and Breast Imaging Reporting and Data System (BI-RADS) terms. Prompts are assembled automatically from structured metadata, requiring no manual clicks. We evaluate on the Breast Lesions USG (BLU) dataset using five-fold cross-validation. Primary metrics are Dice and Intersection over Union (IoU); we also conduct size-stratified analyses and ablations to assess the roles of the global and local paths and the text-driven modulation. Results: XBusNet achieves state-of-the-art performance on BLU, with mean Dice of 0.8765 and IoU of 0.8149, outperforming six strong baselines. Small lesions show the largest gains, with fewer missed regions and fewer spurious activations. Ablation studies show complementary contributions of global context, local boundary modeling, and prompt-based modulation. Conclusions: A dual-prompt, dual-branch multimodal design that merges global semantics with local precision yields accurate BUS segmentation masks and improves robustness for small, low-contrast lesions.

[35] Reconstruction Alignment Improves Unified Multimodal Models

Ji Xie,Trevor Darrell,Luke Zettlemoyer,XuDong Wang

Main category: cs.CV

TL;DR: Reconstruction Alignment (RecA) 是一种资源高效的训练后方法,通过利用视觉理解编码器的嵌入作为密集的‘文本提示’,在没有额外标注的情况下提供丰富监督,显著提升统一多模态模型(UMMs)的生成和编辑性能。

Details Motivation: 传统的统一多模态模型训练依赖稀疏的图像-文本对,缺乏细粒度视觉细节的监督,限制了模型的生成和理解能力。RecA 旨在通过自监督的重建损失,重新对齐模型的生成和理解能力。

Contribution: 提出 RecA,一种资源高效的训练后对齐方法;证明其能显著提升多种 UMMs 架构的生成和编辑性能;在多个基准测试中超越更大的开源模型。

Method: RecA 利用视觉理解编码器的嵌入作为条件输入,优化模型以自监督方式重建图像,从而对齐理解和生成能力。适用于自回归、掩码自回归和扩散式 UMMs。

Result: 仅用 27 GPU 小时,RecA 显著提升了图像生成(GenEval: 0.73→0.90;DPGBench: 80.93→88.15)和编辑性能(ImgEdit: 3.38→3.75;GEdit: 6.94→7.25)。

Insight: RecA 通过在训练后阶段引入密集的自监督信号,无需额外标注即可显著提升多模态模型的性能,展示了训练后对齐的高效性和通用性。

Abstract: Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details–even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense “text prompts,” providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

[36] DEPF: A UAV Multispectral Object Detector with Dual-Domain Enhancement and Priority-Guided Mamba Fusion

Shucong Li,Zhenyu Liu,Zijie Hong,Zhiheng Zhou,Xianghai Cao

Main category: cs.CV

TL;DR: DEPF提出了一种基于Mamba的双域增强和优先级引导融合的多光谱目标检测器,解决了低光照图像、局部小目标建模复杂性和计算复杂度高的问题。

Details Motivation: 多光谱遥感目标检测在无人机应用中面临低光照图像影响、局部小目标建模冗余以及Transformer计算复杂度高的挑战。

Contribution: 设计了双域增强模块(DDE)和优先级引导Mamba融合模块(PGMF),分别提升低光照图像质量和局部目标建模能力。

Method: DDE使用跨尺度Mamba扫描和傅里叶细节恢复增强图像;PGMF通过优先级扫描减少冗余信息干扰。

Result: 在DroneVehicle和VEDAI数据集上表现优于现有方法。

Insight: 结合Mamba的线性复杂度和优先级引导策略,在多光谱目标检测中实现了高效和精准的融合。

Abstract: Multispectral remote sensing object detection is one of the important application of unmanned aerial vehicle (UAV). However, it faces three challenges. Firstly, the low-light remote sensing images reduce the complementarity during multi-modality fusion. Secondly, the local small target modeling is interfered with redundant information in the fusion stage easily. Thirdly, due to the quadratic computational complexity, it is hard to apply the transformer-based methods on the UAV platform. To address these limitations, motivated by Mamba with linear complexity, a UAV multispectral object detector with dual-domain enhancement and priority-guided mamba fusion (DEPF) is proposed. Firstly, to enhance low-light remote sensing images, Dual-Domain Enhancement Module (DDE) is designed, which contains Cross-Scale Wavelet Mamba (CSWM) and Fourier Details Recovery block (FDR). CSWM applies cross-scale mamba scanning for the low-frequency components to enhance the global brightness of images, while FDR constructs spectrum recovery network to enhance the frequency spectra features for recovering the texture-details. Secondly, to enhance local target modeling and reduce the impact of redundant information during fusion, Priority-Guided Mamba Fusion Module (PGMF) is designed. PGMF introduces the concept of priority scanning, which starts from local targets features according to the priority scores obtained from modality difference. Experiments on DroneVehicle dataset and VEDAI dataset reports that, DEPF performs well on object detection, comparing with state-of-the-art methods. Our code is available in the supplementary material.

[37] G3CN: Gaussian Topology Refinement Gated Graph Convolutional Network for Skeleton-Based Action Recognition

Haiqing Ren,Zhongkai Luo,Heng Fan,Xiaohui Yuan,Guanchen Wang,Libo Zhang

Main category: cs.CV

TL;DR: 本论文提出了一种高斯拓扑细化门控图卷积网络(G³CN),用于解决骨架动作识别中模糊动作难以区分的问题。

Details Motivation: 现有图卷积网络(GCNs)在骨架动作识别中表现优异,但在模糊动作的区分上存在局限性,原因是其拓扑和空间特征表示不足。

Contribution: G³CN通过引入高斯滤波器细化骨架拓扑图,并结合门控循环单元(GRUs)增强骨架点间的信息传播,显著提升了模糊动作的识别能力。

Method: G³CN结合了高斯滤波器对拓扑图进行细化,并整合GRUs到GCN框架中,优化信息传播和特征提取。

Result: 在NTU RGB+D、NTU RGB+D 120和NW-UCLA基准测试中,G³CN显著提升了动作识别性能,尤其是对模糊样本。

Insight: 高斯滤波器和GRUs的结合可以增强拓扑图的表示能力,从而解决骨架动作识别中的模糊性问题。

Abstract: Graph Convolutional Networks (GCNs) have proven to be highly effective for skeleton-based action recognition, primarily due to their ability to leverage graph topology for feature aggregation, a key factor in extracting meaningful representations. However, despite their success, GCNs often struggle to effectively distinguish between ambiguous actions, revealing limitations in the representation of learned topological and spatial features. To address this challenge, we propose a novel approach, Gaussian Topology Refinement Gated Graph Convolution (G$^{3}$CN), to address the challenge of distinguishing ambiguous actions in skeleton-based action recognition. G$^{3}$CN incorporates a Gaussian filter to refine the skeleton topology graph, improving the representation of ambiguous actions. Additionally, Gated Recurrent Units (GRUs) are integrated into the GCN framework to enhance information propagation between skeleton points. Our method shows strong generalization across various GCN backbones. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and NW-UCLA benchmarks demonstrate that G$^{3}$CN effectively improves action recognition, particularly for ambiguous samples.

[38] Parse Graph-Based Visual-Language Interaction for Human Pose Estimation

Shibang Liu,Xuemei Xie,Guangming Shi

Main category: cs.CV

TL;DR: 这篇论文提出了基于解析图的视觉-语言交互方法(PGVL),通过新颖的引导模块(GM)解决现有视觉-语言融合方法在遮挡场景下响应弱和定位失败的问题。PGVL结合了局部和全局特征,并通过递归双向跨注意力实现有效的信息融合,显著提升了姿态估计的性能。

Details Motivation: 现有的视觉-语言融合方法在姿态估计中通常采用全局特征集成,忽视了遮挡区域的局部响应,导致定位和语义对齐失败。本文旨在通过解析图的层次结构整合局部和全局特征,提升遮挡场景下的姿态估计效果。

Contribution: 1. 提出了PGVL框架,通过解析图结构和引导模块(GM)实现视觉-语言的多模态融合;2. GM模块确保高语义节点能引导低语义节点的特征更新,增强遮挡区域的局部响应;3. 设计了基于PGVL的网络,并在多个主流姿态估计数据集上验证了其有效性。

Method: 1. 构建模态特定的解析图;2. 通过递归双向跨注意力实现视觉与语言特征的交互,并利用GM模块净化特征;3. 结合自上而下的分解和自下而上的组合方法,逐步整合局部和全局信息。

Result: PGVL在主流姿态估计数据集上表现优异,尤其在遮挡场景下显著提升了性能。

Insight: 通过解析图的层次结构和多模态交互,能够有效捕捉遮挡区域的局部特征,同时利用语言提供的先验信息增强全局推断能力。

Abstract: Parse graphs boost human pose estimation (HPE) by integrating context and hierarchies, yet prior work mostly focuses on single modality modeling, ignoring the potential of multimodal fusion. Notably, language offers rich HPE priors like spatial relations for occluded scenes, but existing visual-language fusion via global feature integration weakens occluded region responses and causes alignment and location failures. To address this issue, we propose Parse Graph-based Visual-Language interaction (PGVL) with a core novel Guided Module (GM). In PGVL, low-level nodes focus on local features, maximizing the maintenance of responses in occluded areas and high-level nodes integrate global features to infer occluded or invisible parts. GM enables high semantic nodes to guide the feature update of low semantic nodes that have undergone cross attention. It ensuring effective fusion of diverse information. PGVL includes top-down decomposition and bottom-up composition. In the first stage, modality specific parse graphs are constructed. Next stage. recursive bidirectional cross-attention is used, purified by GM. We also design network based on PGVL. The PGVL and our network is validated on major pose estimation datasets. We will release the code soon.

[39] In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

Taiying Peng,Jiacheng Hua,Miao Liu,Feng Lu

Main category: cs.CV

TL;DR: 该论文提出了EgoGazeVQA,一个基于注视信息的自我中心视频问答基准,通过整合注视数据来提升多模态大语言模型(MLLM)对用户意图的理解能力。

Details Motivation: 现有的基准测试忽视了注视作为用户意图指示器的重要性,限制了MLLM在自我中心视频中的个性化表现。

Contribution: 提出EgoGazeVQA基准,开发注视引导的意图提示方法,显著提升MLLM在自我中心视频中的意图理解能力。

Method: 通过MLLM生成注视相关的问答对,并用人标注优化,结合空间、时间和意图相关线索改进意图提示。

Result: 实验显示现有MLLM在意图理解上表现不佳,而注视引导方法显著提升性能,并验证了注视估计精度对提示效果的影响。

Insight: 注视数据对提升自我中心环境下AI助手的个性化和有效性具有重要价值。

Abstract: The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants’ ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings.

[40] GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

Xudong Lu,Zhi Zheng,Yi Wan,Yongxiang Yao,Annan Wang,Renrui Zhang,Panwang Xia,Qiong Wu,Qingyun Li,Weifeng Lin,Xiangyu Zhao,Xue Yang,Hongsheng Li

Main category: cs.CV

TL;DR: GLEAM 提出了一个统一的跨视角地理定位框架,结合多视图和多模态对齐,并引入可解释性推理任务 GLEAM-X,通过大语言模型生成双语基准数据集。

Details Motivation: 现有跨视角地理定位方法通常局限于单一视图或模态,且缺乏可解释性,无法说明匹配的依据。

Contribution: 1. 提出统一的 CVGL 框架 GLEAM-C,支持多视图和多模态对齐;2. 引入可解释性任务 GLEAM-X,结合大语言模型生成双语数据集。

Method: 1. GLEAM-C 通过两阶段训练策略对齐多模态数据;2. GLEAM-X 利用 MLLMs 生成解释性数据,并通过人工修订构建测试集。

Result: GLEAM-C 在跨视图匹配中达到与现有方法相当的精度;GLEAM-X 提供了系统的可解释性评估基准。

Insight: 多模态对齐和可解释性是地理定位领域的重要方向,结合大语言模型可显著提升模型的透明度和实用性。

Abstract: Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they merely predict whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery. Our framework enhances training efficiency through optimized implementation while achieving accuracy comparable to prior modality-specific CVGL models through a two-phase training strategy. Moreover, to address the lack of interpretability in traditional CVGL methods, we leverage the reasoning capabilities of multimodal large language models (MLLMs) to propose a new task, GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning. To support this task, we construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro to generate training and testing data. The test set is further refined through detailed human revision, enabling systematic evaluation of explainable cross-view reasoning and advancing transparency and scalability in geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.

[41] XOCT: Enhancing OCT to OCTA Translation via Cross-Dimensional Supervised Multi-Scale Feature Learning

Pooya Khosravi,Kun Han,Anthony T. Wu,Arghavan Rezvani,Zexin Feng,Xiaohui Xie

Main category: cs.CV

TL;DR: XOCT提出了一种新的深度学习框架,结合跨维度监督(CDS)和多尺度特征融合(MSFF)网络,用于视网膜分层血管重建,提升了OCT到OCTA的转换质量。

Details Motivation: 高质量的OCTA图像获取困难且成本高,现有深度学习方法忽略了视网膜层的血管差异,难以重建密集的血管细节,影响了诊断可靠性。

Contribution: 提出了CDS模块和多尺度特征融合网络MSFF,前者利用分层投影作为监督信号,后者通过多尺度特征提取加强血管分割。

Method: CDS模块通过分层投影监督学习分层表示,MSFF模块结合多尺度特征和通道重加权策略捕捉血管细节。

Result: 在OCTA-500数据集上表现优异,提升了OCTA图像的临床诊断价值。

Insight: 跨维度监督和多尺度特征融合的结合显著提升了视网膜血管的重建质量,有望降低OCTA技术的使用门槛。

Abstract: Optical Coherence Tomography Angiography (OCTA) and its derived en-face projections provide high-resolution visualization of the retinal and choroidal vasculature, which is critical for the rapid and accurate diagnosis of retinal diseases. However, acquiring high-quality OCTA images is challenging due to motion sensitivity and the high costs associated with software modifications for conventional OCT devices. Moreover, current deep learning methods for OCT-to-OCTA translation often overlook the vascular differences across retinal layers and struggle to reconstruct the intricate, dense vascular details necessary for reliable diagnosis. To overcome these limitations, we propose XOCT, a novel deep learning framework that integrates Cross-Dimensional Supervision (CDS) with a Multi-Scale Feature Fusion (MSFF) network for layer-aware vascular reconstruction. Our CDS module leverages 2D layer-wise en-face projections, generated via segmentation-weighted z-axis averaging, as supervisory signals to compel the network to learn distinct representations for each retinal layer through fine-grained, targeted guidance. Meanwhile, the MSFF module enhances vessel delineation through multi-scale feature extraction combined with a channel reweighting strategy, effectively capturing vascular details at multiple spatial scales. Our experiments on the OCTA-500 dataset demonstrate XOCT’s improvements, especially for the en-face projections which are significant for clinical evaluation of retinal pathologies, underscoring its potential to enhance OCTA accessibility, reliability, and diagnostic value for ophthalmic disease detection and monitoring. The code is available at https://github.com/uci-cbcl/XOCT.

[42] Bias-Aware Machine Unlearning: Towards Fairer Vision Models via Controllable Forgetting

Sai Siddhartha Chary Aylapuram,Veeraraju Elluru,Shivang Agarwal

Main category: cs.CV

TL;DR: 该论文提出了基于机器遗忘的偏见缓解方法,通过选择性移除训练数据中的偏见样本或特征,提升视觉模型的公平性,同时在多个数据集上验证了方法的有效性。

Details Motivation: 深度神经网络常依赖训练数据中的虚假相关性,导致安全隐患领域的预测偏见。传统方法通常需要重新训练或调整数据流程,而本研究探索了机器遗忘作为一种高效替代方案。

Contribution: 论文的核心贡献是提出‘Bias-Aware Machine Unlearning’范式,结合隐私保护技术,采用梯度上升、LoRA等方法选择性消除偏见,显著提升了模型的公平性能。

Method: 论文基于隐私保护的遗忘技术,测试了梯度上升(Gradient Ascent)、LoRA和师生蒸馏(Teacher-Student distillation)等策略,选择性地移除偏见特征或样本。

Result: 在CUB-200-2011、CIFAR-10和CelebA数据集上,方法分别将人口均等指标提升了94.86%、30.28%和97.37%,同时保持了较高的模型性能。

Insight: 研究表明,机器遗忘可以作为一种实用的工具,在不需完全重新训练的情况下,显著提升模型的公平性,且对准确性的影响较小。

Abstract: Deep neural networks often rely on spurious correlations in training data, leading to biased or unfair predictions in safety-critical domains such as medicine and autonomous driving. While conventional bias mitigation typically requires retraining from scratch or redesigning data pipelines, recent advances in machine unlearning provide a promising alternative for post-hoc model correction. In this work, we investigate \textit{Bias-Aware Machine Unlearning}, a paradigm that selectively removes biased samples or feature representations to mitigate diverse forms of bias in vision models. Building on privacy-preserving unlearning techniques, we evaluate various strategies including Gradient Ascent, LoRA, and Teacher-Student distillation. Through empirical analysis on three benchmark datasets, CUB-200-2011 (pose bias), CIFAR-10 (synthetic patch bias), and CelebA (gender bias in smile detection), we demonstrate that post-hoc unlearning can substantially reduce subgroup disparities, with improvements in demographic parity of up to \textbf{94.86%} on CUB-200, \textbf{30.28%} on CIFAR-10, and \textbf{97.37%} on CelebA. These gains are achieved with minimal accuracy loss and with methods scoring an average of 0.62 across the 3 settings on the joint evaluation of utility, fairness, quality, and privacy. Our findings establish machine unlearning as a practical framework for enhancing fairness in deployed vision systems without necessitating full retraining.

[43] ANYPORTAL: Zero-Shot Consistent Video Background Replacement

Wenshuo Gao,Xicheng Lan,Shuai Yang

Main category: cs.CV

TL;DR: ANYPORTAL 是一种基于预训练扩散模型的零样本视频背景替换框架,通过结合视频和图像扩散模型的优势,实现了高质量的视频编辑。

Details Motivation: 现有视频生成方法难以实现对视频细节的精细控制,限制了其实用性。ANYPORTAL 旨在通过零样本方式解决视频背景替换中的一致性和光照协调问题。

Contribution: 1. 提出了零样本框架 ANYPORTAL,结合视频和图像扩散模型的能力;2. 设计了 Refinement Projection Algorithm,确保前景一致性;3. 无需训练,适用于消费级 GPU。

Method: 1. 利用视频扩散模型的时间先验和图像扩散模型的重光照能力;2. 通过 Refinement Projection Algorithm 实现像素级细节调整;3. 零样本设置,避免训练开销。

Result: ANYPORTAL 在消费级 GPU 上实现了高质量的视频背景替换,解决了前景一致性和时间相干性的问题。

Insight: 结合不同扩散模型的优势可以在无需训练的情况下实现复杂的视频编辑任务,为视频内容创作提供了高效实用的解决方案。

Abstract: Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.

[44] MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

Patrick Wienholt,Christiane Kuhl,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn

Main category: cs.CV

TL;DR: MedicalPatchNet是一种基于patch的自解释AI架构,用于胸部X光分类,通过将图像分割为独立分类的patch并聚合预测,显著提高了解释性和病理定位准确性。

Details Motivation: 现有的深度神经网络在放射影像分类中表现优异,但解释性差限制了其临床应用。因此需要一种自解释的架构,提高医疗AI的可信度。

Contribution: 提出了MedicalPatchNet,一种基于patch的自解释架构,能够在匹配现有模型性能的同时显著提升解释性和病理定位准确性。

Method: 将胸部X光图像分割为非重叠patch,独立分类每个patch,并聚合预测结果,无需后处理即可可视化每个patch的诊断贡献。

Result: 在CheXpert数据集上,MedicalPatchNet的分类性能(AUROC 0.907)与EfficientNet-B0(0.908)相当,但在CheXlocalize数据集上的病理定位准确性更高(平均hit-rate 0.485 vs. 0.376)。

Insight: 通过patch级的自解释设计,MedicalPatchNet不仅提升了模型透明度,还减少了捷径学习的风险,为医疗AI的应用提供了更可靠的解释性支持。

Abstract: Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance. We present MedicalPatchNet, an inherently self-explainable architecture for chest X-ray classification that transparently attributes decisions to distinct image regions. MedicalPatchNet splits images into non-overlapping patches, independently classifies each patch, and aggregates predictions, enabling intuitive visualization of each patch’s diagnostic contribution without post-hoc techniques. Trained on the CheXpert dataset (223,414 images), MedicalPatchNet matches the classification performance (AUROC 0.907 vs. 0.908) of EfficientNet-B0, while substantially improving interpretability: MedicalPatchNet demonstrates substantially improved interpretability with higher pathology localization accuracy (mean hit-rate 0.485 vs. 0.376 with Grad-CAM) on the CheXlocalize dataset. By providing explicit, reliable explanations accessible even to non-AI experts, MedicalPatchNet mitigates risks associated with shortcut learning, thus improving clinical trust. Our model is publicly available with reproducible training and inference scripts and contributes to safer, explainable AI-assisted diagnostics across medical imaging domains. We make the code publicly available: https://github.com/TruhnLab/MedicalPatchNet

[45] LINR Bridge: Vector Graphic Animation via Neural Implicits and Video Diffusion Priors

Wenshuo Gao,Xicheng Lan,Luyao Zhang,Shuai Yang

Main category: cs.CV

TL;DR: 该论文提出了一种名为LINR Bridge的新方法,通过结合神经隐式表示和视频扩散先验,自动化矢量图形动画的生成过程。

Details Motivation: 矢量图形因其可扩展性和用户友好性而备受青睐,但其动画生成通常需要大量人工干预。现有方法在灵活性和动画质量上存在局限,因此作者希望提出一种更高效、自动化程度更高的解决方案。

Contribution: 主要贡献包括:(1)采用分层的神经隐式表示重构矢量图形,保留其无限分辨率和精确的颜色、形状约束;(2)利用文本到视频扩散模型的运动先验,通过视频分数蒸馏采样优化神经表示;(3)实现矢量图形的平滑变形以生成生动的动画。

Method: 方法分三步:(1)使用分层的神经隐式表示重构矢量图形;(2)利用预训练的文本到视频扩散模型优化的视频分数蒸馏采样;(3)通过变形矢量图形匹配神经表示结果。

Result: 实验表明,该方法能生成自然、生动的矢量图形动画,显著优于现有技术。

Insight: 结合神经隐式表示和预训练扩散模型的运动先验,可以有效弥合矢量图形与生成模型之间的领域差距,为自动化高质量动画生成提供新思路。

Abstract: Vector graphics, known for their scalability and user-friendliness, provide a unique approach to visual content compared to traditional pixel-based images. Animation of these graphics, driven by the motion of their elements, offers enhanced comprehensibility and controllability but often requires substantial manual effort. To automate this process, we propose a novel method that integrates implicit neural representations with text-to-video diffusion models for vector graphic animation. Our approach employs layered implicit neural representations to reconstruct vector graphics, preserving their inherent properties such as infinite resolution and precise color and shape constraints, which effectively bridges the large domain gap between vector graphics and diffusion models. The neural representations are then optimized using video score distillation sampling, which leverages motion priors from pretrained text-to-video diffusion models. Finally, the vector graphics are warped to match the representations resulting in smooth animation. Experimental results validate the effectiveness of our method in generating vivid and natural vector graphic animations, demonstrating significant improvement over existing techniques that suffer from limitations in flexibility and animation quality.

[46] Fine-Tuning Vision-Language Models for Visual Navigation Assistance

Xiao Li,Bharat Gandhi,Ming Zhan,Mohit Nehra,Zhicheng Zhang,Yuchen Sun,Meijia Song,Naisheng Zhang,Xi Wang

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于视觉-语言模型的室内导航辅助方法,通过微调BLIP-2模型(结合LoRA技术)生成逐步导航指令,帮助视障人士到达目标位置,并提出了一种改进的评估指标。

Details Motivation: 传统导航系统在室内由于缺乏精确位置数据而效果不佳。论文希望通过结合视觉和语言模型,提升视障人士在室内环境中的导航能力。

Contribution: 1. 使用LoRA技术微调BLIP-2模型,生成更准确的逐步导航指令;2. 提出了一种改进的评估指标,强调方向和顺序变量,更全面地衡量导航性能。

Method: 1. 收集手动标注的室内导航数据集;2. 利用LoRA微调BLIP-2模型;3. 设计新的评估指标(改进BERT F1分数)。

Result: 实验表明,微调后的模型在生成方向性指令方面表现显著提升,克服了原始BLIP-2模型的局限性。

Insight: 结合LoRA技术可以有效提升视觉-语言模型在特定任务(如导航)中的性能,同时改进的评估指标能更准确地反映实际需求。

Abstract: We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.

[47] DiGS: Accurate and Complete Surface Reconstruction from 3D Gaussians via Direct SDF Learning

Wenzhi Guo,Bing Wang

Main category: cs.CV

TL;DR: DiGS 将 SDF(有符号距离场)学习直接嵌入 3D 高斯泼溅(3DGS)流程,通过为每个高斯分配可学习的 SDF 值,显式对齐几何并提升跨视图一致性。结合几何引导的网格增长策略,实现了更准确和完整的表面重建。

Details Motivation: 3DGS 虽然在视图合成中表现出色,但由于其非结构化表示和缺乏显式几何监督,表面重建仍存在挑战。DiGS 通过直接学习 SDF,填补了这一空白。

Contribution: 1. 提出 DiGS,首次将 SDF 学习直接嵌入 3DGS 流程,增强了几何一致性;2. 设计了基于几何的多尺度网格增长策略,提升覆盖密度和连贯性。

Method: 1. 为每个高斯分配可学习的 SDF 值,显式对齐几何;2. 通过多尺度层次结构,自适应分布高斯到几何一致的区域。

Result: 在 DTU、Mip-NeRF 360 和 Tanks & Temples 等基准测试中,DiGS 显著提高了重建精度和完整性,同时保持了高渲染保真度。

Insight: SDF 与 3DGS 的结合能够同时优化几何和外观,为复杂场景的表面重建提供了新思路。

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful paradigm for photorealistic view synthesis, representing scenes with spatially distributed Gaussian primitives. While highly effective for rendering, achieving accurate and complete surface reconstruction remains challenging due to the unstructured nature of the representation and the absence of explicit geometric supervision. In this work, we propose DiGS, a unified framework that embeds Signed Distance Field (SDF) learning directly into the 3DGS pipeline, thereby enforcing strong and interpretable surface priors. By associating each Gaussian with a learnable SDF value, DiGS explicitly aligns primitives with underlying geometry and improves cross-view consistency. To further ensure dense and coherent coverage, we design a geometry-guided grid growth strategy that adaptively distributes Gaussians along geometry-consistent regions under a multi-scale hierarchy. Extensive experiments on standard benchmarks, including DTU, Mip-NeRF 360, and Tanks& Temples, demonstrate that DiGS consistently improves reconstruction accuracy and completeness while retaining high rendering fidelity.

[48] Generating Transferrable Adversarial Examples via Local Mixing and Logits Optimization for Remote Sensing Object Recognition

Chun Liu,Hailong Wang,Bingqian Zhu,Panpan Ding,Zheng Zheng,Tao Xu,Zhigang Han,Jiayao Wang

Main category: cs.CV

TL;DR: 该论文提出了一种通过局部混合和对数优化的方法,生成可迁移的对抗样本,用于遥感目标识别任务。通过保留全局语义信息和改进损失函数,显著提升了对抗样本的迁移性和攻击成功率。

Details Motivation: 现有对抗攻击方法在全局混合或直接交换图像区域时会破坏语义信息,且依赖交叉熵损失导致梯度消失问题。论文旨在解决这些问题,提升对抗样本的迁移性和有效性。

Contribution: 1)提出局部混合策略,生成多样且语义一致的输入;2)将对数损失应用于非目标攻击场景,避免梯度消失;3)引入扰动平滑损失抑制高频噪声。

Method: 1)局部混合策略(不同于MixUp和MixCut);2)对数损失优化;3)扰动平滑损失。

Result: 在FGSCR-42和MTARSI数据集上的实验显示,该方法优于12种现有方法,特别是在MTARSI数据集上,黑盒攻击成功率平均提升17.28%。

Insight: 保留语义信息的局部混合和改进的损失函数设计是提升对抗样本迁移性的关键。非目标攻击场景下的对数损失可以有效避免梯度消失。

Abstract: Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, posing significant security threats to their deployment in remote sensing applications. Research on adversarial attacks not only reveals model vulnerabilities but also provides critical insights for enhancing robustness. Although current mixing-based strategies have been proposed to increase the transferability of adversarial examples, they either perform global blending or directly exchange a region in the images, which may destroy global semantic features and mislead the optimization of adversarial examples. Furthermore, their reliance on cross-entropy loss for perturbation optimization leads to gradient diminishing during iterative updates, compromising adversarial example quality. To address these limitations, we focus on non-targeted attacks and propose a novel framework via local mixing and logits optimization. First, we present a local mixing strategy to generate diverse yet semantically consistent inputs. Different from MixUp, which globally blends two images, and MixCut, which stitches images together, our method merely blends local regions to preserve global semantic information. Second, we adapt the logit loss from targeted attacks to non-targeted scenarios, mitigating the gradient vanishing problem of cross-entropy loss. Third, a perturbation smoothing loss is applied to suppress high-frequency noise and enhance transferability. Extensive experiments on FGSCR-42 and MTARSI datasets demonstrate superior performance over 12 state-of-the-art methods across 6 surrogate models. Notably, with ResNet as the surrogate on MTARSI, our method achieves a 17.28% average improvement in black-box attack success rate.

[49] EHWGesture – A dataset for multimodal understanding of clinical gestures

Gianluca Amprimo,Alberto Ancilotto,Alessandro Savino,Fabio Quazzolo,Claudia Ferraris,Gabriella Olmo,Elisabetta Farella,Stefano Di Carlo

Main category: cs.CV

TL;DR: 本文介绍了EHWGesture数据集,用于多模态临床手势理解,包含丰富的手势数据和精确的标注,支持手势分类、触发检测和动作质量评估。

Details Motivation: 现有动态手势识别数据集缺乏多模态、多视角多样性以及精确的地面真值标注,且未包含动作质量评价任务。

Contribution: 提出了EHWGesture数据集,包含1100多段录制的临床相关手势数据,涵盖RGB-Depth相机和事件相机的多模态输入,并提供了运动捕捉系统的高精度手部标志点标注。

Method: 数据集通过两种高分辨率RGB-Depth相机和事件相机采集,结合运动捕捉系统进行同步和空间校准,并将录制的数据组织为执行速度类别以评估动作质量。

Result: 基线实验验证了数据集在手势分类、触发检测和动作质量评估任务上的潜力。

Insight: EHWGesture填补了动态手势理解中多模态数据集的不足,并将动作质量评估嵌入任务中,为临床手势研究提供了新的基准。

Abstract: Hand gesture understanding is essential for several applications in human-computer interaction, including automatic clinical assessment of hand dexterity. While deep learning has advanced static gesture recognition, dynamic gesture understanding remains challenging due to complex spatiotemporal variations. Moreover, existing datasets often lack multimodal and multi-view diversity, precise ground-truth tracking, and an action quality component embedded within gestures. This paper introduces EHWGesture, a multimodal video dataset for gesture understanding featuring five clinically relevant gestures. It includes over 1,100 recordings (6 hours), captured from 25 healthy subjects using two high-resolution RGB-Depth cameras and an event camera. A motion capture system provides precise ground-truth hand landmark tracking, and all devices are spatially calibrated and synchronized to ensure cross-modal alignment. Moreover, to embed an action quality task within gesture understanding, collected recordings are organized in classes of execution speed that mirror clinical evaluations of hand dexterity. Baseline experiments highlight the dataset’s potential for gesture classification, gesture trigger detection, and action quality assessment. Thus, EHWGesture can serve as a comprehensive benchmark for advancing multimodal clinical gesture understanding.

[50] HU-based Foreground Masking for 3D Medical Masked Image Modeling

Jin Lee,Vu Dang,Gwang-Hyun Yu,Anh Le,Zahid Rahman,Jin-Ho Jang,Heonzoo Lee,Kun-Yung Kim,Jin-Sul Kim,Jin-Young Kim

Main category: cs.CV

TL;DR: 该论文提出了一种基于Hounsfield Unit(HU)的前景掩码策略,以提高3D医学图像中的掩码图像建模(MIM)性能,避免随机掩码忽略解剖对象密度的问题。

Details Motivation: 随机掩码在3D医学图像中效果有限,因为它未考虑解剖对象的密度和重要性,尤其是医学图像中非组织区域(如空气和液体)缺乏诊断意义的特征。

Contribution: 提出了一种HU-based Foreground Masking策略,通过HU测量值区分组织区域和非组织区域,从而显著提升了医学图像分割的性能。

Method: 利用HU值的强度分布设计掩码策略,重点关注内脏器官区域,排除无意义的非组织区域。

Result: 在五个公开的3D医学影像数据集(BTCV、Flare22、MM-WHS、Amos22、BraTS)上,分割质量和Dice分数显著提升。

Insight: 医学图像中的MIM任务需要领域特定的掩码策略,HU值是一个简单而有效的区分工具。

Abstract: While Masked Image Modeling (MIM) has revolutionized fields of computer vision, its adoption in 3D medical image computing has been limited by the use of random masking, which overlooks the density of anatomical objects. To address this limitation, we enhance the pretext task with a simple yet effective masking strategy. Leveraging Hounsfield Unit (HU) measurements, we implement an HU-based Foreground Masking, which focuses on the intensity distribution of visceral organs and excludes non-tissue regions, such as air and fluid, that lack diagnostically meaningful features. Extensive experiments on five public 3D medical imaging datasets demonstrate that our masking consistently improves performance, both in quality of segmentation and Dice score (BTCV:84.64%, Flare22:92.43%, MM-WHS:90.67%, Amos22:88.64%, BraTS:~78.55%). These results underscore the importance of domain-centric MIM and suggest a promising direction for representation learning in medical image segmentation. Implementation is available at github.com/AISeedHub/SubFore/.

[51] TextlessRAG: End-to-End Visual Document RAG by Speech Without Text

Peijin Xie,Shun Qian,Bingquan Liu,Dexin Wang,Lin Sun,Xiangzheng Zhang

Main category: cs.CV

TL;DR: TextlessRAG提出了一种端到端的视觉文档问答框架,直接通过语音输入解决问题,消除了传统ASR、TTS和OCR步骤,并引入了布局感知的重新排序机制以提高检索效果。

Details Motivation: 文档图像包含丰富知识,语音查询具有灵活性和便携性,但现有方法未结合这两者。TextlessRAG旨在实现语音直接驱动的文档问答,避免中间文本转换的复杂性。

Contribution: 1. 首个端到端的语音驱动的视觉文档问答框架;2. 避免了ASR、TTS和OCR步骤;3. 引入了布局感知的重新排序机制;4. 发布了首个双语(中英文)语音-文档RAG数据集。

Method: 通过完全无文本的流程直接解析语音、检索相关视觉知识并生成答案,结合布局感知的重新排序机制优化检索效果。

Result: 实验证明该方法在效率和准确性上有显著提升。

Insight: 语音驱动的端到端视觉文档问答系统可以绕过传统文本转换步骤,减少信息损失,同时布局信息对文档检索具有重要价值。

Abstract: Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech–document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag

[52] PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image

Peng Li,Yisheng He,Yingdong Hu,Yuan Dong,Weihao Yuan,Yuan Liu,Zilong Dong,Yike Guo

Main category: cs.CV

TL;DR: PanoLAM提出了一种快速、单次前馈的全头高斯合成框架,无需传统GAN反转和测试时优化,利用合成数据训练,通过粗细高斯生成管道和双分支结构实现高效重建。

Details Motivation: 当前基于单张无姿态图像的全头重建方法依赖耗时GAN反转和优化,缺乏高效快速的重建框架。

Contribution: 1. 提出单次前馈的快速全头高斯合成框架;2. 利用合成数据解决真实3D头数据稀缺问题;3. 引入粗细高斯生成管道和双分支结构提升重建效率和质量。

Method: 1. 使用合成数据集训练;2. 通过FLAME稀疏点与图像特征交互实现粗重建;3. 密集化后完成高保真重建;4. 双分支结构结合球面三平面与点特征。

Result: 实验证明框架在效率和重建质量上优于现有方法。

Insight: 利用预训练3D GAN的合成数据和双分支特征聚合能显著提升单次重建的效果。

Abstract: We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work.

[53] Attention Maps in 3D Shape Classification for Dental Stage Estimation with Class Node Graph Attention Networks

Barkin Buyukcakir,Rocharles Cavalcante Fontenele,Reinhilde Jacobs,Jannick De Tobel,Patrick Thevissen,Dirk Vandermeulen,Peter Claes

Main category: cs.CV

TL;DR: 本文提出了一种名为类节点图注意力网络(CGAT)的新架构,用于3D形状分类任务,特别是在牙科阶段估计中。该模型通过图注意力卷积和可视化注意力机制提升透明度和分类性能。

Details Motivation: 深度学习在医疗等高风险应用中缺乏透明度,阻碍了其广泛应用。CGAT旨在通过可解释的注意力机制解决这一问题。

Contribution: 1. 提出CGAT架构,结合图注意力和全局CLS节点;2. 评估了多种节点特征(如局部平均曲率和质心距离)对性能和可解释性的影响;3. 展示了CGAT在牙科数据上的应用和泛化能力。

Method: 采用图注意力卷积和注意力rollout机制,并研究了不同节点特征(局部平均曲率、质心距离)对模型的影响。引入全局CLS节点以提升注意力图的可解释性。

Result: 结合局部平均曲率和质心距离的特征提升了性能(加权F1分数0.76),并生成了更全面的注意力可视化。

Insight: 通过全局CLS节点生成的注意力图更具直观性,有助于提升高风险应用中模型的透明度与可信度。CGAT可推广至其他基于图的分类任务。

Abstract: Deep learning offers a promising avenue for automating many recognition tasks in fields such as medicine and forensics. However, the black-box nature of these models hinders their adoption in high-stakes applications where trust and accountability are required. For 3D shape recognition tasks in particular, this paper introduces the Class Node Graph Attention Network (CGAT) architecture to address this need. Applied to 3D meshes of third molars derived from CBCT images, for Demirjian stage allocation, CGAT utilizes graph attention convolutions and an inherent attention mechanism, visualized via attention rollout, to explain its decision-making process. We evaluated the local mean curvature and distance to centroid node features, both individually and in combination, as well as model depth, finding that models incorporating directed edges to a global CLS node produced more intuitive attention maps, while also yielding desirable classification performance. We analyzed the attention-based explanations of the models, and their predictive performances to propose optimal settings for the CGAT. The combination of local mean curvature and distance to centroid as node features yielded a slight performance increase with 0.76 weighted F1 score, and more comprehensive attention visualizations. The CGAT architecture’s ability to generate human-understandable attention maps can enhance trust and facilitate expert validation of model decisions. While demonstrated on dental data, CGAT is broadly applicable to graph-based classification and regression tasks, promoting wider adoption of transparent and competitive deep learning models in high-stakes environments.

[54] Temporal Image Forensics: A Review and Critical Evaluation

Robert Jöchl,Andreas Uhl

Main category: cs.CV

TL;DR: 该综述论文回顾并批判性评估了基于时间痕迹的时域图像取证技术,提出了新取证设置,验证了传感器缺陷的主要特性,揭示了方法可能利用内容偏差而非真实痕迹,并探讨了解释性AI的重要性。

Details Motivation: 时域图像取证的目标是估计数字图像的年龄,但现有方法可能依赖内容偏差而非真实时间痕迹,因此需要更可靠的取证技术和解释性方法。

Contribution: 1. 提出更现实的取证设置;2. 验证传感器缺陷的生长率和空间分布特性;3. 揭示某方法实际依赖内容偏差而非真实痕迹;4. 研究神经网络学习的特征;5. 展示神经网络容易被干扰的现象。

Method: 综述现有研究,重新实现实验,验证传感器缺陷特性,分析特征学习,探讨解释性AI方法。

Result: 发现某些方法实际依赖内容偏差,神经网络易受干扰,解释性AI有助于验证可靠性。

Insight: 时域图像取证技术的可靠性需谨慎验证,避免内容偏差的干扰,解释性AI是关键工具。

Abstract: Temporal image forensics is the science of estimating the age of a digital image. Usually, time-dependent traces (age traces) introduced by the image acquisition pipeline are exploited for this purpose. In this review, a comprehensive overview of the field of temporal image forensics based on time-dependent traces from the image acquisition pipeline is given. This includes a detailed insight into the properties of known age traces (i.e., in-field sensor defects and sensor dust) and temporal image forensics techniques. Another key aspect of this work is to highlight the problem of content bias and to illustrate how important eXplainable Artificial Intelligence methods are to verify the reliability of temporal image forensics techniques. Apart from reviewing material presented in previous works, in this review: (i) a new (probably more realistic) forensic setting is proposed; (ii) the main properties (growth rate and spatial distribution) of in-field sensor defects are verified; (iii) it is shown that a method proposed to utilize in-field sensor defects for image age approximation actually exploits other traces (most likely content bias); (iv) the features learned by a neural network dating palmprint images are further investigated; (v) it is shown how easily a neural network can be distracted from learning age traces. For this purpose, previous work is analyzed, re-implemented if required and experiments are conducted.

[55] Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Yusuke Hirota,Ryo Hachiuma,Boyi Li,Ximing Lu,Michael Ross Boone,Boris Ivanovic,Yejin Choi,Marco Pavone,Yu-Chiang Frank Wang,Noa Garcia,Yuta Nakashima,Chao-Han Huck Yang

Main category: cs.CV

TL;DR: 论文指出当前视觉语言基础模型(VLMs)性别偏见评估中,由于基准数据中存在性别与非性别特征(如物体和背景)的虚假关联,评估结果可能失真。通过系统性扰动非性别特征,研究发现微小变化即可显著改变偏见得分,表明现有评估方法可靠性存疑,建议报告偏见指标时同时提供特征敏感性测量。

Details Motivation: 当前性别偏见评估基准中存在虚假特征关联,可能导致评估结果不可靠,亟需验证这些虚假特征对偏见评估的影响。

Contribution: 揭示了虚假特征对性别偏见评估的显著影响,提出通过扰动实验量化这一影响,为更可靠的评估方法提供建议。

Method: 对四个广泛使用的基准(COCO-gender、FACET、MIAP和PHASE)进行系统性非性别特征扰动(如物体掩码和背景模糊),分析其对偏见得分的改变。

Result: 研究发现即使10%的物体掩码或轻微背景模糊也能使偏见得分大幅变化(生成式VLMs高达175%,CLIP变体达43%),表明当前评估中偏见得分可能受虚假特征主导。

Insight: 由于完全消除虚假特征的基准难以构建,建议在报告偏见指标时同时测量特征敏感性,以提高评估的可靠性。

Abstract: Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

[56] Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer’s Disease

Fangqi Cheng,Surajit Ray,Xiaochen Yang

Main category: cs.CV

TL;DR: 该论文提出了一种数据高效微调方法,用于将基于3D CT的医学视觉-语言模型(Med-VLMs)适配到3D MRI上,并应用于阿尔茨海默病(AD)诊断。通过结构化元数据转换为合成报告和添加辅助token预测MMSE分数,实现了使用较少数据达到SOTA性能。

Details Motivation: 现有的医学视觉-语言模型(Med-VLMs)在3D医学影像上表现有限,且未充分利用患者元数据和临床诊断知识。此外,大规模训练或微调需要大量计算资源,而3D影像的结构化信息缺失进一步限制了模型的性能。

Contribution: 1. 提出了一种数据高效的微调流程,适用于3D MRI的AD诊断;2. 通过结构化元数据转换为合成报告,提升图像-文本对齐;3. 引入了预测MMSE分数的辅助token,为微调提供额外监督。

Method: 1. 将结构化元数据转换为合成报告以丰富文本输入;2. 添加辅助token预测MMSE分数;3. 在图像和文本模态上应用轻量级提示微调(prompt tuning)。

Result: 在仅使用1,500张训练图像的情况下,该方法在两个AD数据集上达到了SOTA性能,优于基于10,000张图像微调的现有方法。

Insight: 通过结合合成报告和辅助任务的监督,可以在数据量有限的情况下显著提升医学视觉-语言模型在3D影像诊断任务中的表现。

Abstract: Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer’s disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on two AD datasets using 1,500 training images, outperforming existing methods fine-tuned on 10,000 images. Code will be released upon publication.

[57] Self-Supervised Cross-Encoder for Neurodegenerative Disease Diagnosis

Fangqi Cheng,Yingying Zhao,Xiaochen Yang

Main category: cs.CV

TL;DR: 该论文提出了一种自监督交叉编码器框架,用于神经退行性疾病的诊断,通过纵向MRI数据的时序连续性进行监督学习,实现高分类准确性和可解释性。

Details Motivation: 现有方法依赖大量标注数据且缺乏可解释性,论文旨在通过自监督学习和时序数据解决这些问题。

Contribution: 1. 提出自监督交叉编码器框架;2. 分解学习表示成静态和动态组件;3. 在ADNI、OASIS和PPMI数据集上展示强泛化能力。

Method: 结合对比学习约束静态表示,输入梯度正则化动态表示,利用纵向MRI数据时序信息进行自监督学习。

Result: 在ADNI数据集上实现高分类准确性,OASIS上表现零样本泛化能力,PPMI上实现跨任务泛化。

Insight: 自监督学习和时序连续性能够显著提升神经退行性疾病诊断的性能和可解释性。

Abstract: Deep learning has shown significant potential in diagnosing neurodegenerative diseases from MRI data. However, most existing methods rely heavily on large volumes of labeled data and often yield representations that lack interpretability. To address both challenges, we propose a novel self-supervised cross-encoder framework that leverages the temporal continuity in longitudinal MRI scans for supervision. This framework disentangles learned representations into two components: a static representation, constrained by contrastive learning, which captures stable anatomical features; and a dynamic representation, guided by input-gradient regularization, which reflects temporal changes and can be effectively fine-tuned for downstream classification tasks. Experimental results on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate that our method achieves superior classification accuracy and improved interpretability. Furthermore, the learned representations exhibit strong zero-shot generalization on the Open Access Series of Imaging Studies (OASIS) dataset and cross-task generalization on the Parkinson Progression Marker Initiative (PPMI) dataset. The code for the proposed method will be made publicly available.

[58] Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity

Sung Ju Lee,Nam Ik Cho

Main category: cs.CV

TL;DR: 该论文提出了一种称为Hermitian对称傅里叶水印(SFW)的新方法,通过强制Hermitian对称性来保持频域完整性,解决了语义水印技术因频域完整性丢失导致的检测性能下降问题。此外,还引入了一种中心感知的嵌入策略,以增强对抗裁剪攻击的鲁棒性。实验表明该方法在多种攻击场景下表现优异。

Details Motivation: 传统的语义水印技术在潜在扩散模型(LDMs)中对抗再生攻击鲁棒,但频域完整性的丢失会导致检测性能下降。为解决这一问题,作者提出了一种新的嵌入方法。

Contribution: 1. 提出了Hermitian对称傅里叶水印(SFW)方法,通过强制Hermitian对称性保持频域完整性;2. 引入了中心感知嵌入策略,提高对抗裁剪攻击的鲁棒性;3. 通过实验验证了方法的优越性。

Method: 1. SFW方法:通过Hermitian对称性保持频域完整性;2. 中心感知嵌入策略:确保信息在裁剪攻击下的鲁棒保留;3. 将方法应用于现有语义水印方案。

Result: 实验表明,SFW方法在多种攻击场景下实现了最高的检测精度和图像保真度(通过FID和CLIP分数验证)。中心感知嵌入策略显著提高了对抗裁剪攻击的能力。

Insight: 该方法展示了如何在语义水印中平衡鲁棒性和图像保真度,为解决这一领域的固有权衡问题提供了有效框架。

Abstract: Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry. Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy. Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID and CLIP scores. Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking. Code available at https://github.com/thomas11809/SFWMark

[59] Beyond Motion Cues and Structural Sparsity: Revisiting Small Moving Target Detection

Guoyi Zhang,Siyang Chen,Guangsheng Xu,Zhihua Shen,Han Wang,Xiaohu Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的深度学习框架TenRPCANet,通过张量低秩和稀疏分解解决小目标检测问题,利用自注意力机制隐式实现低秩先验,并在红外和空间目标检测任务中取得领先性能。

Details Motivation: 小目标检测因信噪比低、背景复杂而极具挑战,现有方法依赖目标特征或运动线索,难以在复杂环境中保持鲁棒性。本文试图通过耦合背景和目标建模,利用背景的低秩特性作为稳定先验,提升检测性能。

Contribution: 1. 提出TenRPCANet框架,将小目标检测问题重新表述为张量低秩和稀疏分解任务;2. 设计基于自注意力的token化策略,隐式实现多阶张量低秩先验;3. 引入特征细化模块增强目标显著性。

Method: 1. 理论分析背景、目标和噪声的张量结构;2. 使用自注意力机制建模背景的低秩特性,避免显式迭代优化;3. 基于张量RPCA的稀疏部分更新设计特征细化模块。

Result: 在红外小目标和空间目标检测任务中达到SOTA性能,验证了方法的有效性和泛化能力。

Insight: 复杂背景中的低秩结构可作为稳定先验,与目标检测耦合建模;自注意力机制能够有效捕捉局部和非局部自相似性,替代传统低秩优化方法。

Abstract: Small moving target detection is crucial for many defense applications but remains highly challenging due to low signal-to-noise ratios, ambiguous visual cues, and cluttered backgrounds. In this work, we propose a novel deep learning framework that differs fundamentally from existing approaches, which often rely on target-specific features or motion cues and tend to lack robustness in complex environments. Our key insight is that small target detection and background discrimination are inherently coupled, even cluttered video backgrounds often exhibit strong low-rank structures that can serve as stable priors for detection. We reformulate the task as a tensor-based low-rank and sparse decomposition problem and conduct a theoretical analysis of the background, target, and noise components to guide model design. Building on these insights, we introduce TenRPCANet, a deep neural network that requires minimal assumptions about target characteristics. Specifically, we propose a tokenization strategy that implicitly enforces multi-order tensor low-rank priors through a self-attention mechanism. This mechanism captures both local and non-local self-similarity to model the low-rank background without relying on explicit iterative optimization. In addition, inspired by the sparse component update in tensor RPCA, we design a feature refinement module to enhance target saliency. The proposed method achieves state-of-the-art performance on two highly distinct and challenging tasks: multi-frame infrared small target detection and space object detection. These results demonstrate both the effectiveness and the generalizability of our approach.

[60] EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration

Haokai Zhu,Bo Qu,Si-Yuan Cao,Runmin Zhang,Shujie Chen,Bailin Yang,Hui-Liang Shen

Main category: cs.CV

TL;DR: EDFFDNet是一种基于指数衰减基函数的自由变形网络,用于高效且精确的无监督多网格图像配准。通过自适应稀疏运动聚合器和渐进相关细化策略,方法减少了参数和计算成本,同时在存在深度差异的场景中表现优异。

Details Motivation: 现有深度图像配准方法在处理包含深度差异的真实场景时存在局限性,导致效率低且精度不足。

Contribution: 1) 提出了基于指数衰减基函数的自由变形网络EDFFDNet;2) 设计了自适应稀疏运动聚合器(ASMA);3) 引入了渐进相关细化策略,提高了效率和精度。

Method: 使用自由变形网络EDFFDNet结合ASMA替代传统MLP运动聚合器,并通过渐进相关细化策略优化运动估计。

Result: 参数、内存和总运行时间分别减少70.5%、32.6%和33.7%,PSNR提升了0.5 dB。进一步优化后(EDFFDNet-2),PSNR再提升1.06 dB。

Insight: 指数衰减基函数和稀疏运动聚合的设计在处理深度差异场景时表现出更强的局部性和效率性,同时保持了高精度。

Abstract: Previous deep image registration methods that employ single homography, multi-grid homography, or thin-plate spline often struggle with real scenes containing depth disparities due to their inherent limitations. To address this, we propose an Exponential-Decay Free-Form Deformation Network (EDFFDNet), which employs free-form deformation with an exponential-decay basis function. This design achieves higher efficiency and performs well in scenes with depth disparities, benefiting from its inherent locality. We also introduce an Adaptive Sparse Motion Aggregator (ASMA), which replaces the MLP motion aggregator used in previous methods. By transforming dense interactions into sparse ones, ASMA reduces parameters and improves accuracy. Additionally, we propose a progressive correlation refinement strategy that leverages global-local correlation patterns for coarse-to-fine motion estimation, further enhancing efficiency and accuracy. Experiments demonstrate that EDFFDNet reduces parameters, memory, and total runtime by 70.5%, 32.6%, and 33.7%, respectively, while achieving a 0.5 dB PSNR gain over the state-of-the-art method. With an additional local refinement stage,EDFFDNet-2 further improves PSNR by 1.06 dB while maintaining lower computational costs. Our method also demonstrates strong generalization ability across datasets, outperforming previous deep learning methods.

[61] Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Boammani Aser Lompo,Marc Haraoui

Main category: cs.CV

TL;DR: 论文提出了Visual-TableQA,一个针对表格图像推理的大规模、开放域多模态基准数据集,通过多模型协作生成6k推理密集型QA对。

Details Motivation: 现有基准在规模、多样性和推理深度上存在局限,尤其是在表格图像方面,需填补这一空白。

Contribution: 1) 提出Visual-TableQA数据集;2) 设计模块化、可扩展的生成流程;3) 展示了数据集的强泛化能力。

Method: 采用多模型协作生成(生成、验证、灵感)的流程,通过交叉模型提示和LLM-jury过滤提升多样性和质量。

Result: 微调后的模型在外部基准上表现优异,超越多个专有模型。

Insight: 多模型协作生成有效提升数据集质量,合成数据也能带来强泛化性能。

Abstract: Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting (‘inspiration’) and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset’s synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.

Xin Lai,Junyi Li,Wei Li,Tao Liu,Tianjian Li,Hengshuang Zhao

Main category: cs.CV

TL;DR: Mini-o3是一个通过扩展工具交互和多轮深度推理来解决复杂视觉搜索任务的新系统,其核心是通过多样化推理模式和优化训练策略实现高性能。

Details Motivation: 现有开源方法在复杂视觉任务中单调的推理模式和有限的交互轮次限制了性能提升,Mini-o3旨在解决这一问题。

Contribution: 1. 构建了Visual Probe Dataset;2. 开发了迭代数据收集流水线以支持多样化推理;3. 提出了over-turn masking策略以平衡训练和推理效率。

Method: 结合多样化数据集、多轮迭代训练和over-turn masking策略,实现了深度推理和高精度视觉搜索。

Result: Mini-o3在复杂视觉搜索任务上达到SOTA性能,推理轮次扩展至数十步且精度随轮次增加而提升。

Insight: 多轮交互和多样化推理模式对复杂任务至关重要;训练时的轮次限制可通过优化策略在推理时突破。

Abstract: Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning – spanning tens of steps – and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

[63] CAViAR: Critic-Augmented Video Agentic Reasoning

Sachit Menon,Ahmet Iscen,Arsha Nagrani,Tobias Weyand,Carl Vondrick,Cordelia Schmid

Main category: cs.CV

TL;DR: 这篇论文提出了CAViAR(Critic-Augmented Video Agentic Reasoning),一种结合大语言模型和视频模块的子代理工具的方法,用于解决视频复杂推理任务。通过引入批评器(critic)来区分代理执行的步骤序列的成功与否,提升了在复杂视频推理任务中的表现。

Details Motivation: 近年来,视频理解在短片段感知方面取得了显著进展,但在复杂查询和较长视频的推理任务中,性能仍然不足。论文探讨如何利用现有的感知能力完成更复杂的视频推理任务。

Contribution: 1. 提出了CAViAR框架,结合大语言模型和视频模块的子代理工具;2. 引入批评器动态评估代理执行的步骤序列;3. 在LVBench、Neptune和ActivityNet-RTL等数据集上取得了优异的性能。

Method: 1. 使用大语言模型作为主代理,调用视频模块(如视觉编程、ViperGPT等)作为子代理或工具;2. 代理根据模块的返回结果动态决定后续步骤;3. 批评器评估步骤序列的成功与否,提供反馈。

Result: 在LVBench、Neptune和ActivityNet-RTL等复杂视频推理数据集上表现优异。

Insight: 通过动态调用模块和批评器的反馈机制,可以有效利用现有感知能力解决复杂的视频推理任务,避免了固定流程的局限性。

Abstract: Video understanding has seen significant progress in recent years, with models’ performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.

[64] XSRD-Net: EXplainable Stroke Relapse Detection

Christian Gapp,Elias Tappeiner,Martin Welk,Karl Fritscher,Stephanie Mangesius,Constantin Eisenschink,Philipp Deisl,Michael Knoflach,Astrid E. Grams,Elke R. Gizewski,Rainer Schubert

Main category: cs.CV

TL;DR: XSRD-Net是一种可解释的深度学习模型,旨在通过多模态数据(影像和表格数据)早期检测中风复发风险患者,并提供复发时间预测。模型在二元分类(AUC 0.84)和复发时间预测(c-index 0.68)任务中表现良好,同时揭示了心脏病和颈动脉病变与中风复发的关联。

Details Motivation: 中风是全球第二大死亡原因,复发率在第一年为5-25%,且复发死亡率高达40%。早期检测复发风险患者对降低复发率至关重要。

Contribution: 1)提出了XSRD-Net,一种多模态深度学习模型,用于中风复发检测和复发时间预测;2)通过可解释性分析揭示了心脏病和颈动脉病变与中风复发的关联。

Method: 1)收集了3D颅内CTA影像数据和表格数据(如心脏病、年龄、性别);2)训练单模态和多模态深度学习模型,分别用于二元分类(复发检测)和复发时间预测任务;3)采用模态贡献度量(vision:tabular = 0.68:0.32)优化模型。

Result: 1)二元分类任务AUC为0.84;2)复发时间预测任务的c-index为0.68,AUC为0.71;3)可解释性分析显示心脏病和颈动脉病变是重要预测因素。

Insight: 多模态数据(影像+表格)在中风复发预测中表现优于单模态数据,且心脏病和颈动脉病变的关联为临床干预提供了新方向。

Abstract: Stroke is the second most frequent cause of death world wide with an annual mortality of around 5.5 million. Recurrence rates of stroke are between 5 and 25% in the first year. As mortality rates for relapses are extraordinarily high (40%) it is of utmost importance to reduce the recurrence rates. We address this issue by detecting patients at risk of stroke recurrence at an early stage in order to enable appropriate therapy planning. To this end we collected 3D intracranial CTA image data and recorded concomitant heart diseases, the age and the gender of stroke patients between 2010 and 2024. We trained single- and multimodal deep learning based neural networks for binary relapse detection (Task 1) and for relapse free survival (RFS) time prediction together with a subsequent classification (Task 2). The separation of relapse from non-relapse patients (Task 1) could be solved with tabular data (AUC on test dataset: 0.84). However, for the main task, the regression (Task 2), our multimodal XSRD-net processed the modalities vision:tabular with 0.68:0.32 according to modality contribution measures. The c-index with respect to relapses for the multimodal model reached 0.68, and the AUC is 0.71 for the test dataset. Final, deeper interpretability analysis results could highlight a link between both heart diseases (tabular) and carotid arteries (vision) for the detection of relapses and the prediction of the RFS time. This is a central outcome that we strive to strengthen with ongoing data collection and model retraining.

[65] HairGS: Hair Strand Reconstruction based on 3D Gaussian Splatting

Yimin Pan,Matthias Nießner,Tobias Kirschstein

Main category: cs.CV

TL;DR: 该论文提出了一种基于3D高斯泼溅(3DGS)的头发丝级几何重建方法HairGS,通过多阶段流程实现对头发拓扑结构的精确建模,并引入了新的评估指标来衡量拓扑准确性。

Details Motivation: 人发重建是计算机视觉中的挑战性问题,在虚拟现实和数字人建模中具有重要应用。现有的方法通常忽略了头发丝的连通性和拓扑结构,而3DGS的高效显式表达与头发丝结构天然契合。

Contribution: 主要贡献包括:1)将3DGS框架扩展到头发丝级几何重建;2)提出多阶段流程(重建、合并、细化)实现对头发拓扑的精确建模;3)引入新的评估指标来衡量拓扑准确性。

Method: 方法分为三个阶段:1)使用可微分高斯栅格化器重建详细头发几何;2)通过新颖的合并方案将高斯段合并为连贯的头发丝;3)在光度监督下细化并生长头发丝。

Result: 在合成和真实数据集上的实验表明,该方法能鲁棒处理多种发型,并在1小时内完成高效重建。

Insight: 3DGS的显式表达适合头发丝建模,而拓扑结构的精确评估是提升头发重建质量的关键。

Abstract: Human hair reconstruction is a challenging problem in computer vision, with growing importance for applications in virtual reality and digital human modeling. Recent advances in 3D Gaussians Splatting (3DGS) provide efficient and explicit scene representations that naturally align with the structure of hair strands. In this work, we extend the 3DGS framework to enable strand-level hair geometry reconstruction from multi-view images. Our multi-stage pipeline first reconstructs detailed hair geometry using a differentiable Gaussian rasterizer, then merges individual Gaussian segments into coherent strands through a novel merging scheme, and finally refines and grows the strands under photometric supervision. While existing methods typically evaluate reconstruction quality at the geometric level, they often neglect the connectivity and topology of hair strands. To address this, we propose a new evaluation metric that serves as a proxy for assessing topological accuracy in strand reconstruction. Extensive experiments on both synthetic and real-world datasets demonstrate that our method robustly handles a wide range of hairstyles and achieves efficient reconstruction, typically completing within one hour. The project page can be found at: https://yimin-pan.github.io/hair-gs/

[66] RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis

Hugo Blanc,Jean-Emmanuel Deschaud,Alexis Paljic

Main category: cs.CV

TL;DR: RayGaussX通过引入空空间跳过、自适应采样和尺度正则化等技术,显著加速了基于高斯分布的体素渲染(Ray Marching),在保持高质量的同时实现了实时渲染,同时在训练和推理速度上有显著提升。

Details Motivation: RayGauss虽然在合成和室内场景中实现了高质量的视角合成,但由于计算成本较高,无法在真实世界场景中实时渲染。RayGaussX的目标是通过优化加速技术解决这一问题。

Contribution: 1. 引入空空间跳过和自适应采样等体素渲染加速策略;2. 增强光线一致性并引入尺度正则化以减少误检;3. 提出新的密度化准则,改善远距离区域的密度分布。

Method: 在RayGauss的基础上,结合BVH、空空间跳过、自适应采样和尺度正则化等技术,优化渲染流程。

Result: 在真实数据集上,训练速度提升5x-12x,渲染速度提升50x-80x(FPS),PSNR提升高达+0.56 dB。

Insight: 通过高效的加速技术和密度分布优化,可以在保持高质量的同时实现实时渲染,适用于大规模的实时场景。

Abstract: RayGauss has achieved state-of-the-art rendering quality for novel-view synthesis on synthetic and indoor scenes by representing radiance and density fields with irregularly distributed elliptical basis functions, rendered via volume ray casting using a Bounding Volume Hierarchy (BVH). However, its computational cost prevents real-time rendering on real-world scenes. Our approach, RayGaussX, builds on RayGauss by introducing key contributions that accelerate both training and inference. Specifically, we incorporate volumetric rendering acceleration strategies such as empty-space skipping and adaptive sampling, enhance ray coherence, and introduce scale regularization to reduce false-positive intersections. Additionally, we propose a new densification criterion that improves density distribution in distant regions, leading to enhanced graphical quality on larger scenes. As a result, RayGaussX achieves 5x to 12x faster training and 50x to 80x higher rendering speeds (FPS) on real-world datasets while improving visual quality by up to +0.56 dB in PSNR. Project page with videos and code: https://raygaussx.github.io/.

[67] SplatFill: 3D Scene Inpainting via Depth-Guided Gaussian Splatting

Mahtab Dahaghin,Milind G. Padalkar,Matteo Toso,Alessio Del Bue

Main category: cs.CV

TL;DR: SplatFill提出了一种基于深度引导的高斯溅射方法,用于3D场景修复,结合深度和物体监督以及一致性感知细化方案,实现了更高的视觉保真度和训练效率。

Details Motivation: 3D高斯溅射(3DGS)虽能生成高度逼真的3D场景表示,但在修复缺失区域(如遮挡或场景编辑)时仍存在模糊细节、伪影和几何不一致等问题。

Contribution: 1. 提出SplatFill,一种深度引导的3DGS场景修复方法;2. 结合深度和物体监督确保高斯溅射的准确放置;3. 提出一致性感知细化方案纠正不一致区域。

Method: 1. 使用深度和物体监督联合优化高斯溅射的位置;2. 通过一致性感知细化选择性修正不一致区域,避免破坏场景其他部分。

Result: 在SPIn-NeRF数据集上,SplatFill在视觉保真度上优于现有方法,训练时间减少24.5%,细节更清晰、伪影更少、视角一致性更强。

Insight: 深度和物体监督的结合以及一致性感知细化是提升3D场景修复效果的关键,同时兼顾效率和准确性。

Abstract: 3D Gaussian Splatting (3DGS) has enabled the creation of highly realistic 3D scene representations from sets of multi-view images. However, inpainting missing regions, whether due to occlusion or scene editing, remains a challenging task, often leading to blurry details, artifacts, and inconsistent geometry. In this work, we introduce SplatFill, a novel depth-guided approach for 3DGS scene inpainting that achieves state-of-the-art perceptual quality and improved efficiency. Our method combines two key ideas: (1) joint depth-based and object-based supervision to ensure inpainted Gaussians are accurately placed in 3D space and aligned with surrounding geometry, and (2) we propose a consistency-aware refinement scheme that selectively identifies and corrects inconsistent regions without disrupting the rest of the scene. Evaluations on the SPIn-NeRF dataset demonstrate that SplatFill not only surpasses existing NeRF-based and 3DGS-based inpainting methods in visual fidelity but also reduces training time by 24.5%. Qualitative results show our method delivers sharper details, fewer artifacts, and greater coherence across challenging viewpoints.

[68] Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model

Zhuoxu Huang,Mingqi Gao,Jungong Han

Main category: cs.CV

TL;DR: 这篇论文提出了一个名为Point Linguist Model (PLM)的通用框架,旨在解决大型语言模型(LLMs)与密集3D点云之间的表示对齐问题。通过引入对象中心判别表示(OcDR)和几何重新激活解码器(GRD),PLM提升了3D对象分割的语义推理能力和几何细节保留能力。

Details Motivation: 现有方法在处理3D点云分割时存在表示不匹配问题:LLMs处理的是高层次语义标记,而3D点云仅提供密集几何结构。这种不匹配限制了输入输出的有效性,包括需要大量的预对齐及分割精度损失。

Contribution: 1. 提出了PLM框架,解决了LLMs与3D点云的表示对齐问题,无需大规模预对齐。2. 引入OcDR学习方法,增强对象级语义和场景关系。3. 设计了GRD解码器,结合语义推理和几何特征以提升分割精度。

Method: 1. OcDR:通过硬负样本感知训练目标学习对象中心标记,捕捉目标语义和场景关系。2. GRD:解码时将OcDR标记与密集特征结合,保留几何细节。

Result: PLM在ScanNetv2和Multi3DRefer等7个基准测试上取得了显著提升,3D参考分割任务中mIoU分别提高了7.3和6.0。

Insight: 通过对象中心推理和几何特征的结合,PLM实现了对3D点云的更鲁棒理解,展示了语义与几何特征协同工作的重要性。

Abstract: 3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.

[69] D-LEAF: Localizing and Correcting Hallucinations in Multimodal LLMs via Layer-to-head Attention Diagnostics

Tiancheng Yang,Lin Zhang,Jiaye Lin,Guimin Hu,Di Wang,Lijie Hu

Main category: cs.CV

TL;DR: 该论文提出了一种名为D-LEAF的方法,通过层到头的注意力诊断定位和校正多模态大语言模型(MLLMs)中的幻觉问题,显著提升了图像描述和视觉问答任务的性能。

Details Motivation: 多模态大语言模型(MLLMs)在图像描述和视觉问答等任务中表现出色,但容易产生与视觉输入冲突的幻觉。传统的注意力调整方法无法准确识别问题来源,因此需要一种更精细的诊断和校正方法。

Contribution: 提出了两种诊断工具(LIAE和IAF)以及动态层熵和注意力融合方法(D-LEAF),用于定位和校正MLLMs中的幻觉问题,显著提高了任务性能。

Method: 1. 使用Layer Image Attention Entropy(LIAE)定位异常层;2. 通过Image Attention Focus(IAF)对注意力头进行评分;3. 结合这些信号,动态应用D-LEAF校正错误。

Result: 在标准图像描述任务中性能提升53%,视觉问答任务的准确率和F1分数提高了约4%,同时效率几乎没有损失。

Insight: 通过细粒度的注意力诊断和动态校正,可以有效抑制幻觉问题,同时保持模型的高效性。

Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance on tasks like image captioning and visual question answering, but remain prone to hallucinations, where generated text conflicts with the visual input. Prior work links this partly to insufficient visual attention, but existing attention-based detectors and mitigation typically apply uniform adjustments across layers and heads, obscuring where errors originate. In this paper, we first show these methods fail to accurately localize problematic layers. Then, we introduce two diagnostics: Layer Image Attention Entropy (LIAE) which flags anomalous layers, and Image Attention Focus (IAF) which scores attention heads within those layers. Analysis shows that LIAE pinpoints faulty layers and IAF reliably ranks heads that warrant correction. Guided by these signals, we propose Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF), a task-agnostic, attention-guided method that dynamically localizes and corrects errors during inference with negligible overhead. Results show our D-LEAF delivers a 53% relative improvement on standard captioning benchmarks, and on VQA both accuracy and F1-score improve by approximately 4%, substantially suppressing hallucinations while preserving efficiency.

[70] Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning

Daniel DeAlcala,Aythami Morales,Julian Fierrez,Gonzalo Mancera,Ruben Tolosana,Javier Ortega-Garcia

Main category: cs.CV

TL;DR: Active MINT是一种通过多任务学习增强模型可审计性的方法,通过同时训练原始模型和辅助模型(MINT模型)来检测训练数据的使用情况,显著提高了检测准确性。

Details Motivation: 当前AI模型在训练数据的使用上缺乏透明度,可能导致隐私、安全和版权问题。需要一种方法来检测数据是否被用于训练,以提高模型的可审计性。

Contribution: 提出Active MINT方法,结合多任务学习和中间激活图输入,显著提高了训练数据的检测准确性(超过80%),并在多种网络架构和公开基准上验证了其有效性。

Method: 通过多任务学习同时训练原始模型和MINT模型,后者专门用于检测训练数据;利用中间激活图作为输入增强检测能力。

Result: 在5个公开基准上,Active MINT的检测准确率超过80%,显著优于现有方法。

Insight: 将模型审计性作为优化目标嵌入训练过程,可以有效提高数据使用的透明度,为AI部署中的隐私和安全提供更强保障。

Abstract: Active Membership Inference Test (aMINT) is a method designed to detect whether given data were used during the training of machine learning models. In Active MINT, we propose a novel multitask learning process that involves training simultaneously two models: the original or Audited Model, and a secondary model, referred to as the MINT Model, responsible for identifying the data used for training the Audited Model. This novel multi-task learning approach has been designed to incorporate the auditability of the model as an optimization objective during the training process of neural networks. The proposed approach incorporates intermediate activation maps as inputs to the MINT layers, which are trained to enhance the detection of training data. We present results using a wide range of neural networks, from lighter architectures such as MobileNet to more complex ones such as Vision Transformers, evaluated in 5 public benchmarks. Our proposed Active MINT achieves over 80% accuracy in detecting if given data was used for training, significantly outperforming previous approaches in the literature. Our aMINT and related methodological developments contribute to increasing transparency in AI models, facilitating stronger safeguards in AI deployments to achieve proper security, privacy, and copyright protection.

[71] Object-level Correlation for Few-Shot Segmentation

Chunlin Wen,Yu Zhang,Jie Fan,Hongyuan Zhu,Xiu-Shen Wei,Yijun Wang,Zhiqiang Kou,Shuzhou Sun

Main category: cs.CV

TL;DR: 本文提出了一种用于少样本分割(FSS)的对象级相关性网络(OCNet),通过建立支持目标对象与查询通用对象之间的相关性,显著减少背景噪声干扰,提升了分割性能。

Details Motivation: 现有方法主要基于图像级相关性,其中包含难以抑制的背景噪声,导致过拟合问题。受生物视觉过程启发,作者提出在对象级信息上建立相关性以更有效识别新对象。

Contribution: 1. 提出对象级相关性网络(OCNet);2. 设计了通用对象挖掘模块(GOMM)和相关性构建模块(CCM);3. 在PASCAL-5^i和COCO-20^i数据集上实现了最优性能。

Method: 1. GOMM通过显著性学习和高层相似性线索构建查询通用对象特征;2. CCM通过目标原型匹配通用对象特征建立对象级相关性;3. 对象级相关性用于挖掘目标特征并抑制背景噪声。

Result: 在PASCAL-5^i和COCO-20^i数据集上的实验表明,OCNet达到了最先进的性能。

Insight: 对象级相关性比图像级相关性更适合少样本分割任务,能够更有效地抑制背景噪声并提升目标识别能力。

Abstract: Few-shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image-level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, \textit{i.e.}, irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object-level information. Target identification in the general objects is more valid than in the entire image, especially in the low-data regime. Inspired by this, we design an Object-level Correlation Network (OCNet) by establishing the object-level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high-level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the object-level correlation by allocating the target prototypes to match the general object feature. The generated object-level correlation can mine the query target feature and suppress the hard pixel noise for the final prediction. Extensive experiments on PASCAL-${5}^{i}$ and COCO-${20}^{i}$ show that our model achieves the state-of-the-art performance.

[72] Multimodal Contrastive Pretraining of CBCT and IOS for Enhanced Tooth Segmentation

Moo Hyun Son,Juyoung Bae,Zelin Qiu,Jiale Peng,Kai Xin Li,Yifan Lin,Hao Chen

Main category: cs.CV

TL;DR: 该论文提出了一种名为ToothMCL的多模态对比学习框架,用于结合CBCT和IOS数据进行牙齿分割,性能显著提升。

Details Motivation: 数字牙科需要精确的牙齿分割方法,但现有方法缺乏验证且性能有限。为此,论文首次提出多模态预训练框架以解决这一问题。

Contribution: 1. 提出ToothMCL框架,首次结合CBCT和IOS模态进行对比学习;2. 构建了最大的配对数据集CBCT-IOS3.8K;3. 在多数据集上验证了方法的优越性和泛化性。

Method: 通过多模态对比学习(Multimodal Contrastive Learning)捕捉模态不变特征,结合CBCT的体素数据和IOS的表面数据,实现精确的牙齿分割和编号识别。

Result: 在内外部测试中,ToothMCL表现最优,CBCT分割DSC提升12%,IOS分割DSC提升8%,且在不同场景下均表现稳健。

Insight: 多模态对比学习能有效整合不同模态数据,提升分割性能,尤其在复杂临床场景下显示出强大潜力。

Abstract: Digital dentistry represents a transformative shift in modern dental practice. The foundational step in this transformation is the accurate digital representation of the patient’s dentition, which is obtained from segmented Cone-Beam Computed Tomography (CBCT) and Intraoral Scans (IOS). Despite the growing interest in digital dental technologies, existing segmentation methodologies frequently lack rigorous validation and demonstrate limited performance and clinical applicability. To the best of our knowledge, this is the first work to introduce a multimodal pretraining framework for tooth segmentation. We present ToothMCL, a Tooth Multimodal Contrastive Learning for pretraining that integrates volumetric (CBCT) and surface-based (IOS) modalities. By capturing modality-invariant representations through multimodal contrastive learning, our approach effectively models fine-grained anatomical features, enabling precise multi-class segmentation and accurate identification of F'ed'eration Dentaire Internationale (FDI) tooth numbering. Along with the framework, we curated CBCT-IOS3.8K, the largest paired CBCT and IOS dataset to date, comprising 3,867 patients. We then evaluated ToothMCL on a comprehensive collection of independent datasets, representing the largest and most diverse evaluation to date. Our method achieves state-of-the-art performance in both internal and external testing, with an increase of 12% for CBCT segmentation and 8% for IOS segmentation in the Dice Similarity Coefficient (DSC). Furthermore, ToothMCL consistently surpasses existing approaches in tooth groups and demonstrates robust generalizability across varying imaging conditions and clinical scenarios.

[73] Dynamic Scene 3D Reconstruction of an Uncooperative Resident Space Object

Bala Prenith Reddy Gopu,Timothy Jacob Huber,George M. Nehma,Patrick Quinn,Madhur Tiwari,Matt Ueckermann,David Hinckley,Christopher McKenna

Main category: cs.CV

TL;DR: 论文研究了在动态场景下对非合作太空目标(RSO)进行三维重建的方法,评估了现有算法的性能,并提出了基于模拟环境的评估框架。

Details Motivation: 在轨服务(OOS)和主动碎片清除(ADR)任务中,准确表征非合作太空目标的几何和运动特性至关重要。解决动态翻滚目标的3D重建问题是关键挑战。

Contribution: 1. 评估了动态场景下3D重建算法的性能;2. 开发了基于Isaac Sim的模拟环境,生成物理精确的2D图像序列;3. 使用Neuralangelo展示了静态场景下的高质量重建结果。

Method: 1. 利用Isaac Sim模拟环境生成动态翻滚目标的2D图像序列;2. 使用Neuralangelo等算法进行3D重建;3. 通过Cloud Compare(CC)评估重建模型的几何准确性和细节捕捉能力。

Result: 静态场景下重建的3D网格与原始CAD模型高度匹配,误差和伪影极少,成功捕捉了任务规划所需的关键细节。

Insight: 模拟环境和静态场景的高质量重建结果为动态场景的进一步研究提供了基线,展现了Neuralangelo在任务中的潜力。

Abstract: Characterization of uncooperative Resident Space Objects (RSO) play a crucial role in On-Orbit Servicing (OOS) and Active Debris Removal (ADR) missions to assess the geometry and motion properties. To address the challenges of reconstructing tumbling uncooperative targets, this study evaluates the performance of existing state-of-the-art 3D reconstruction algorithms for dynamic scenes, focusing on their ability to generate geometrically accurate models with high-fidelity. To support our evaluation, we developed a simulation environment using Isaac Sim to generate physics-accurate 2D image sequences of tumbling satellite under realistic orbital lighting conditions. Our preliminary results on static scenes using Neuralangelo demonstrate promising reconstruction quality. The generated 3D meshes closely match the original CAD models with minimal errors and artifacts when compared using Cloud Compare (CC). The reconstructed models were able to capture critical fine details for mission planning. This provides a baseline for our ongoing evaluation of dynamic scene reconstruction.

[74] Feature Space Analysis by Guided Diffusion Model

Kimiaki Shirahama,Miki Yanobu,Kaduki Yamashita,Miho Ohsaki

Main category: cs.CV

TL;DR: 论文提出了一种基于引导扩散模型的解码器,用于分析DNN的特征空间,通过生成与用户指定特征接近的图像,揭示DNN编码的图像属性。

Details Motivation: DNN的特征提取过程通常被视为黑箱,难以理解。论文旨在通过生成特定特征的图像,揭示DNN特征空间中的编码信息。

Contribution: 提出了一种无需额外训练的引导扩散模型解码器,能够生成特征与用户指定特征高度匹配的图像,并支持多种DNN的实时分析。

Method: 使用预训练的扩散模型,通过反向生成图像时最小化估计特征与目标特征的欧式距离,实现特征空间的引导分析。

Result: 实验针对CLIP图像编码器、ResNet-50和视觉Transformer,生成的图像特征与目标特征高度相似,揭示了DNN特征空间的有价值信息。

Insight: 该方法不仅提供了理解DNN特征空间的新工具,还展示了扩散模型在特征解释任务中的潜力,且计算效率高。

Abstract: One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a decoder that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our decoder allows us to evidence which of various attributes in an image are encoded into a feature by the DNN, by generating images whose features are in proximity to that feature. Our decoder is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our decoder is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP’s image encoder, ResNet-50 and vision transformer demonstrate that images generated by our decoder have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs’ feature spaces.

[75] Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon,Jaewoo Jung,Junwan Kim,Hyungyu Choi,Heeseong Shin,Sangbeom Lim,Honggyu An,Chaehyun Kim,Jisang Han,Donghyun Kim,Chanho Eom,Sunghwan Hong,Seungryong Kim

Main category: cs.CV

TL;DR: 论文提出了VIRAL(视觉表示对齐)方法,通过对齐MLLMs(多模态大语言模型)与预训练视觉基础模型的内部视觉表示,弥补了MLLMs在视觉中心任务中的不足。

Details Motivation: 现有的MLLMs在视觉中心任务(如物体计数或空间推理)中表现不佳,主要原因是仅依赖文本监督导致视觉细节丢失。

Contribution: 提出了VIRAL方法,通过视觉表示对齐,增强MLLMs保留视觉细节和推理复杂视觉输入的能力。

Method: 使用预训练视觉基础模型的内部表示对齐MLLMs的视觉表示,作为正则化策略。

Result: 实验表明,VIRAL在广泛采用的多模态基准测试中一致提升性能。

Insight: 显式对齐视觉表示能够有效弥补MLLMs在视觉任务中的缺陷,为视觉信息的整合提供了新方向。

Abstract: Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

cs.RO [Back]

[76] DepthVision: Robust Vision-Language Understanding through GAN-Based LiDAR-to-RGB Synthesis

Sven Kirchner,Nils Purschke,Ross Greer,Alois C. Knoll

Main category: cs.RO

TL;DR: DepthVision是一个多模态场景理解框架,通过GAN从稀疏LiDAR点云合成RGB图像,并结合真实RGB数据动态适应光照条件,解决了视觉输入退化问题。

Details Motivation: 在视觉输入退化或不足的情况下,确保机器人可靠运行是机器人的核心挑战。DepthVision旨在通过多模态方法弥补传感器退化问题。

Contribution: 提出了DepthVision框架,首次结合LiDAR点云生成RGB图像,并通过LAMA动态融合数据,无需微调下游视觉语言模型。

Method: 使用条件GAN从LiDAR点云合成RGB图像,结合LAMA动态调整真实与合成数据,适应光照条件。

Result: 在低光条件下性能显著提升,优于仅使用RGB的基线方法,同时兼容现有冻结的视觉语言模型。

Insight: LiDAR引导的RGB合成技术为现实环境中的机器人操作提供了鲁棒性解决方案。

Abstract: Ensuring reliable robot operation when visual input is degraded or insufficient remains a central challenge in robotics. This letter introduces DepthVision, a framework for multimodal scene understanding designed to address this problem. Unlike existing Vision-Language Models (VLMs), which use only camera-based visual input alongside language, DepthVision synthesizes RGB images from sparse LiDAR point clouds using a conditional generative adversarial network (GAN) with an integrated refiner network. These synthetic views are then combined with real RGB data using a Luminance-Aware Modality Adaptation (LAMA), which blends the two types of data dynamically based on ambient lighting conditions. This approach compensates for sensor degradation, such as darkness or motion blur, without requiring any fine-tuning of downstream vision-language models. We evaluate DepthVision on real and simulated datasets across various models and tasks, with particular attention to safety-critical tasks. The results demonstrate that our approach improves performance in low-light conditions, achieving substantial gains over RGB-only baselines while preserving compatibility with frozen VLMs. This work highlights the potential of LiDAR-guided RGB synthesis for achieving robust robot operation in real-world environments.

[77] Can SSD-Mamba2 Unlock Reinforcement Learning for End-to-End Motion Control?

Gavin Tao,Yinuo Wang,Jinzhao Zhou

Main category: cs.RO

TL;DR: 该论文提出了一种基于SSD-Mamba2的跨模态强化学习框架,用于端到端运动控制,解决了现有方法在计算-内存权衡和长时记忆依赖上的不足。

Details Motivation: 端到端强化学习在运动控制中具有潜力,但现有控制器多为盲操作或依赖计算效率低下的融合网络。如何高效融合感知与动作、解决长时信用分配问题是研究动机。

Contribution: 主要贡献是提出了基于SSD-Mamba2的跨模态RL框架,结合状态空间对偶性(SSD),实现了高效的长范围依赖建模和硬件感知的线性计算扩展。

Method: 方法是利用SSD-Mamba2作为融合主干,将本体感知和外部感知(如深度token)编码为紧凑token,通过选择性状态空间更新保留长范围依赖,同时降低计算和内存开销。

Result: 实验表明,该方法在多样化的运动控制场景中,性能优于现有基线,包括回报、安全性(碰撞与跌倒)和样本效率,计算效率更高且收敛更快。

Insight: SSD-Mamba2为可扩展、前瞻性和高效的运动控制提供了一种实用的融合主干,其选择性状态空间更新设计显著优于传统的自注意力机制。

Abstract: End-to-end reinforcement learning for motion control promises unified perception-action policies that scale across embodiments and tasks, yet most deployed controllers are either blind (proprioception-only) or rely on fusion backbones with unfavorable compute-memory trade-offs. Recurrent controllers struggle with long-horizon credit assignment, and Transformer-based fusion incurs quadratic cost in token length, limiting temporal and spatial context. We present a vision-driven cross-modal RL framework built on SSD-Mamba2, a selective state-space backbone that applies state-space duality (SSD) to enable both recurrent and convolutional scanning with hardware-aware streaming and near-linear scaling. Proprioceptive states and exteroceptive observations (e.g., depth tokens) are encoded into compact tokens and fused by stacked SSD-Mamba2 layers. The selective state-space updates retain long-range dependencies with markedly lower latency and memory use than quadratic self-attention, enabling longer look-ahead, higher token resolution, and stable training under limited compute. Policies are trained end-to-end under curricula that randomize terrain and appearance and progressively increase scene complexity. A compact, state-centric reward balances task progress, energy efficiency, and safety. Across diverse motion-control scenarios, our approach consistently surpasses strong state-of-the-art baselines in return, safety (collisions and falls), and sample efficiency, while converging faster at the same compute budget. These results suggest that SSD-Mamba2 provides a practical fusion backbone for scalable, foresightful, and efficient end-to-end motion control.

cs.SD [Back]

[78] Adversarial Attacks on Audio Deepfake Detection: A Benchmark and Comparative Study

Kutub Uddin,Muhammad Umar Farooq,Awais Khan,Khalid Mahmood Malik

Main category: cs.SD

TL;DR: 本文研究了对音频深度伪造检测(ADD)方法的对抗攻击,比较了五种深度伪造数据集上的多种攻击技术,揭示了ADD方法的漏洞,并提出设计更鲁棒检测器的方向。

Details Motivation: 生成式AI的高质量深度伪造音频对语音生物识别应用构成威胁,而现有的ADD方法易受反取证(AF)攻击影响,其有效性被削弱。

Contribution: 1. 对SoTA ADD方法进行了全面的对抗攻击评估;2. 提供了基于原始音频和频谱图方法的比较分析;3. 为设计鲁棒检测器和自适应防御策略提供了指导。

Method: 使用统计修改(如音高偏移、滤波、噪声添加)和优化攻击(如FGSM、PGD、C&W、DeepFool)两类AF技术,在五种深度伪造数据集上评估ADD方法的鲁棒性。

Result: 研究发现现有ADD方法在对抗条件下存在显著漏洞,频谱图方法在某些攻击下表现更好,但均未完全抵抗所有AF攻击。

Insight: ADD方法需进一步改进以应对多样化的AF攻击,设计自适应防御策略是未来研究的关键方向。

Abstract: The widespread use of generative AI has shown remarkable success in producing highly realistic deepfakes, posing a serious threat to various voice biometric applications, including speaker verification, voice biometrics, audio conferencing, and criminal investigations. To counteract this, several state-of-the-art (SoTA) audio deepfake detection (ADD) methods have been proposed to identify generative AI signatures to distinguish between real and deepfake audio. However, the effectiveness of these methods is severely undermined by anti-forensic (AF) attacks that conceal generative signatures. These AF attacks span a wide range of techniques, including statistical modifications (e.g., pitch shifting, filtering, noise addition, and quantization) and optimization-based attacks (e.g., FGSM, PGD, C & W, and DeepFool). In this paper, we investigate the SoTA ADD methods and provide a comparative analysis to highlight their effectiveness in exposing deepfake signatures, as well as their vulnerabilities under adversarial conditions. We conducted an extensive evaluation of ADD methods on five deepfake benchmark datasets using two categories: raw and spectrogram-based approaches. This comparative analysis enables a deeper understanding of the strengths and limitations of SoTA ADD methods against diverse AF attacks. It does not only highlight vulnerabilities of ADD methods, but also informs the design of more robust and generalized detectors for real-world voice biometrics. It will further guide future research in developing adaptive defense strategies that can effectively counter evolving AF techniques.

cs.HC [Back]

[79] Enhancing Online Learning by Integrating Biosensors and Multimodal Learning Analytics for Detecting and Predicting Student Behavior: A Review

Alvaro Becerra,Ruth Cobos,Charles Lang

Main category: cs.HC

TL;DR: 这篇系统综述探讨了如何通过生物传感器和多模态学习分析(MmLA)来检测和预测学生在在线学习中的行为,强调了生理信号对认知状态和参与度的深层洞察作用。

Details Motivation: 在线学习中理解和预测学生行为对于提升参与度和优化教育成果至关重要,生物传感器和多模态数据的结合为解决这一问题提供了新思路。

Contribution: 综述了54项关键研究,整合了生理信号(如心率、脑活动、眼动)与传统交互数据的分析方法,指出了多模态数据在个性化学习和实时反馈中的潜力。

Method: 分析了包括高级机器学习算法和多模态数据预处理在内的常用方法,探讨了情感和注意力检测、行为分析及实验设计等关键挑战。

Result: 研究发现,结合多模态数据有助于实现个性化学习体验和智能教育干预,推动了自适应在线学习系统的发展。

Insight: 生物传感器驱动的自适应学习系统具有变革性潜力,但仍需解决数据收集的局限性和实验设计的标准化问题。

Abstract: In modern online learning, understanding and predicting student behavior is crucial for enhancing engagement and optimizing educational outcomes. This systematic review explores the integration of biosensors and Multimodal Learning Analytics (MmLA) to analyze and predict student behavior during computer-based learning sessions. We examine key challenges, including emotion and attention detection, behavioral analysis, experimental design, and demographic considerations in data collection. Our study highlights the growing role of physiological signals, such as heart rate, brain activity, and eye-tracking, combined with traditional interaction data and self-reports to gain deeper insights into cognitive states and engagement levels. We synthesize findings from 54 key studies, analyzing commonly used methodologies such as advanced machine learning algorithms and multimodal data pre-processing techniques. The review identifies current research trends, limitations, and emerging directions in the field, emphasizing the transformative potential of biosensor-driven adaptive learning systems. Our findings suggest that integrating multimodal data can facilitate personalized learning experiences, real-time feedback, and intelligent educational interventions, ultimately advancing toward a more customized and adaptive online learning experience.

cs.LG [Back]

[80] Measuring Uncertainty in Transformer Circuits with Effective Information Consistency

Anatoly A. Krasnovsky

Main category: cs.LG

TL;DR: 本文提出了一种名为EICS(Effective-Information Consistency Score)的指标,用于量化Transformer Circuits(TCs)中的不确定性和一致性,从而评估其行为的可信度。EICS结合了局部Jacobian和激活的不一致性以及电路层级的因果涌现度量,并通过实际案例验证了其有效性。

Details Motivation: 现有的Transformer Circuits(TCs)研究缺乏一种形式化的单次评估方法,以量化其行为的一致性和可信度。本文旨在填补这一空白,提供一种白盒方法,帮助研究者更好地理解和信任TCs的输出。

Contribution: 1. 提出EICS指标,结合局部不一致性和全局因果涌现度量;2. 提供了一种白盒、单次评估的方法;3. 展示了EICS的实用性和高效性。

Method: 1. 通过局部Jacobian和激活计算归一化的sheaf inconsistency;2. 使用高斯EI代理从同一前向状态推导电路层级的因果涌现;3. 结合两者形成EICS指标。

Result: EICS能够有效地量化TCs的行为一致性,并通过玩具实验验证了其合理性。LLM任务的实证验证尚未完成。

Insight: EICS为理解和评估Transformer Circuits的行为提供了一种新工具,有望在实际应用中增强模型的透明度和可信度。

Abstract: Mechanistic interpretability has identified functional subgraphs within large language models (LLMs), known as Transformer Circuits (TCs), that appear to implement specific algorithms. Yet we lack a formal, single-pass way to quantify when an active circuit is behaving coherently and thus likely trustworthy. Building on prior systems-theoretic proposals, we specialize a sheaf/cohomology and causal emergence perspective to TCs and introduce the Effective-Information Consistency Score (EICS). EICS combines (i) a normalized sheaf inconsistency computed from local Jacobians and activations, with (ii) a Gaussian EI proxy for circuit-level causal emergence derived from the same forward state. The construction is white-box, single-pass, and makes units explicit so that the score is dimensionless. We further provide practical guidance on score interpretation, computational overhead (with fast and exact modes), and a toy sanity-check analysis. Empirical validation on LLM tasks is deferred.

[81] Benchmarking Vision Transformers and CNNs for Thermal Photovoltaic Fault Detection with Explainable AI Validation

Serra Aksoy

Main category: cs.LG

TL;DR: 本文系统地比较了CNN和Vision Transformer在光伏热故障检测中的表现,并通过XRAI显著性分析验证模型决策是否符合热物理原理,发现Swin Transformer表现最佳。

Details Motivation: 光伏热故障检测中深度学习的高精度需求与模型决策透明性不足之间存在矛盾,阻碍了AI在能源基础设施中的应用。

Contribution: 首次系统地对比了CNN和Vision Transformer在光伏热故障检测中的性能,并通过热物理原理验证了模型的可解释性。

Method: 使用ResNet-18、EfficientNet-B0、ViT-Tiny和Swin-Tiny模型对20,000张红外图像进行分析,并通过XRAI显著性分析验证模型决策。

Result: Swin Transformer表现最佳(二分类准确率94%,多分类准确率73%),且模型学习到的特征与热物理原理一致。

Insight: 热物理指导的可解释性方法为能源监测中的AI验证提供了新思路,但环境因素(如污垢)的检测仍是挑战。

Abstract: Artificial intelligence deployment for automated photovoltaic (PV) monitoring faces interpretability barriers that limit adoption in energy infrastructure applications. While deep learning achieves high accuracy in thermal fault detection, validation that model decisions align with thermal physics principles remains lacking, creating deployment hesitancy where understanding model reasoning is critical. This study provides a systematic comparison of convolutional neural networks (ResNet-18, EfficientNet-B0) and vision transformers (ViT-Tiny, Swin-Tiny) for thermal PV fault detection, using XRAI saliency analysis to assess alignment with thermal physics principles. This represents the first systematic comparison of CNNs and vision transformers for thermal PV fault detection with physics-validated interpretability. Evaluation on 20,000 infrared images spanning normal operation and 11 fault categories shows that Swin Transformer achieves the highest performance (94% binary accuracy; 73% multiclass accuracy) compared to CNN approaches. XRAI analysis reveals that models learn physically meaningful features, such as localized hotspots for cell defects, linear thermal paths for diode failures, and thermal boundaries for vegetation shading, consistent with expected thermal signatures. However, performance varies significantly across fault types: electrical faults achieve strong detection (F1-scores >0.90) while environmental factors like soiling remain challenging (F1-scores 0.20-0.33), indicating limitations imposed by thermal imaging resolution. The thermal physics-guided interpretability approach provides methodology for validating AI decision-making in energy monitoring applications, addressing deployment barriers in renewable energy infrastructure.

[82] GCond: Gradient Conflict Resolution via Accumulation-based Stabilization for Large-Scale Multi-Task Learning

Evgeny Alves Limarenko,Anastasiia Alexandrovna Studenikina

Main category: cs.LG

TL;DR: GCond是一种用于多任务学习中梯度冲突解决的高效方法,结合梯度累积和自适应仲裁机制,显著提升了计算效率并保持了优化质量。

Details Motivation: 多任务学习中梯度冲突问题显著,现有方法如PCGrad计算成本高,限制了其在现代大模型中的应用。

Contribution: 提出了GCond方法,结合PCGrad原理和梯度累积,实现计算效率提升和优化质量保持,适用于不同规模的模型。

Method: GCond通过梯度累积和自适应仲裁机制解决梯度冲突,同时支持随机模式以加速计算。

Result: GCond在MobileNetV3-Small和ConvNeXt等架构上表现出色,计算速度提升两倍,且在所有评估指标上优于其他方法。

Insight: GCond展现了高扩展性,适用于紧凑模型和大模型,并兼容多种现代优化器,为梯度冲突问题提供了高效解决方案。

Abstract: In multi-task learning (MTL), gradient conflict poses a significant challenge. Effective methods for addressing this problem, including PCGrad, CAGrad, and GradNorm, in their original implementations are computationally demanding, which significantly limits their application in modern large models and transformers. We propose Gradient Conductor (GCond), a method that builds upon PCGrad principles by combining them with gradient accumulation and an adaptive arbitration mechanism. We evaluated GCond on self-supervised learning tasks using MobileNetV3-Small and ConvNeXt architectures on the ImageNet 1K dataset and a combined head and neck CT scan dataset, comparing the proposed method against baseline linear combinations and state-of-the-art gradient conflict resolution methods. The stochastic mode of GCond achieved a two-fold computational speedup while maintaining optimization quality, and demonstrated superior performance across all evaluated metrics, achieving lower L1 and SSIM losses compared to other methods on both datasets. GCond exhibited high scalability, being successfully applied to both compact models (MobileNetV3-Small) and large architectures (ConvNeXt-tiny and ConvNeXt-Base). It also showed compatibility with modern optimizers such as AdamW and Lion/LARS. Therefore, GCond offers a scalable and efficient solution to the problem of gradient conflicts in multi-task learning.

[83] EfficientNet in Digital Twin-based Cardiac Arrest Prediction and Analysis

Qasim Zia,Avais Jan,Zafar Iqbal,Muhammad Mumtaz Ali,Mukarram Ali,Murray Patterson

Main category: cs.LG

TL;DR: 论文提出了一种结合EfficientNet和数字孪生(Digital Twin)技术的新框架,用于早期预测和分析心脏骤停,通过心血管图像特征学习与个性化建模,实现了高效且高精度的预测。

Details Motivation: 心脏骤停是全球重大健康问题之一,早期识别和管理对改善患者预后至关重要。传统方法难以满足个性化与实时性需求,因此需要结合深度学习与数字孪生技术。

Contribution: 提出了一种新颖的框架,结合EfficientNet模型和数字孪生系统,实现心脏骤停的早期预测与分析。

Method: 使用EfficientNet的复合缩放技术学习心血管图像特征,同时利用数字孪生技术基于IoT设备数据构建患者的个性化心血管系统模型。

Result: 实验表明,系统在预测能力和效率上均表现优异,实现了高精度和实时性。

Insight: 结合深度学习与数字孪生技术,为心脏疾病的主动个性化预测提供了新思路。

Abstract: Cardiac arrest is one of the biggest global health problems, and early identification and management are key to enhancing the patient’s prognosis. In this paper, we propose a novel framework that combines an EfficientNet-based deep learning model with a digital twin system to improve the early detection and analysis of cardiac arrest. We use compound scaling and EfficientNet to learn the features of cardiovascular images. In parallel, the digital twin creates a realistic and individualized cardiovascular system model of the patient based on data received from the Internet of Things (IoT) devices attached to the patient, which can help in the constant assessment of the patient and the impact of possible treatment plans. As shown by our experiments, the proposed system is highly accurate in its prediction abilities and, at the same time, efficient. Combining highly advanced techniques such as deep learning and digital twin (DT) technology presents the possibility of using an active and individual approach to predicting cardiac disease.

cs.AI [Back]

[84] From Eigenmodes to Proofs: Integrating Graph Spectral Operators with Symbolic Interpretable Reasoning

Andrew Kiruluta,Priscilla Burity

Main category: cs.AI

TL;DR: Spectral NSR是一个神经符号推理框架,结合了图信号处理和符号推理的优势,通过图拉普拉斯特征结构实现高性能和可解释性。

Details Motivation: 结合符号推理的可解释性与谱学习的适应性和可扩展性,提出一种统一的推理框架,以提升性能和透明度。

Contribution: 提出了Spectral NSR框架,首次将逻辑规则嵌入为谱模板,并在图谱域中直接推理,同时引入了多项扩展(如动态图学习、不确定性量化等)。

Method: 利用图信号处理技术,基于图拉普拉斯特征结构设计频率选择性滤波器,并结合模块化专家系统、谱课程学习等扩展方法。

Result: 在ProofWriter和CLUTRR等基准测试中,Spectral NSR在准确性、推理速度、鲁棒性和可解释性方面优于基线方法(如Transformer和神经符号逻辑编程系统)。

Insight: 图谱域的推理方法不仅高效且可解释,还能通过谱对齐实现跨领域的有效迁移,为下一代推理系统提供了透明且鲁棒的基础。

Abstract: We introduce Spectral NSR, a fully spectral neuro-symbolic reasoning framework that embeds logical rules as spectral templates and performs inference directly in the graph spectral domain. By leveraging graph signal processing (GSP) and frequency-selective filters grounded in the Laplacian eigenstructure of knowledge graphs, the architecture unifies the interpretability of symbolic reasoning with the scalability and adaptability of spectral learning. Beyond the core formulation, we incorporate a comprehensive set of extensions, including dynamic graph and basis learning, rational and diffusion filters for sharper spectral selectivity, mixture-of-spectral-experts for modular specialization, proof-guided training with spectral curricula, and uncertainty quantification for calibrated confidence. Additional enhancements such as large language model coupling, co-spectral transfer alignment, adversarial robustness, efficient GPU kernels, generalized Laplacians, and causal interventions further expand the versatility of the framework. Empirical evaluation on state-of-the-art reasoning benchmarks such as ProofWriter and CLUTRR demonstrates that Spectral NSR achieves superior accuracy, faster inference, improved robustness to adversarial perturbations, and higher interpretability compared to leading baselines including transformers, message-passing neural networks, and neuro-symbolic logic programming systems. Spectral attribution and proof-band agreement analyses confirm that model decisions align closely with symbolic proof structures, while transfer experiments validate effective domain adaptation through co-spectral alignment. These results establish Spectral NSR as a scalable and principled foundation for the next generation of reasoning systems, offering transparency, robustness, and generalization beyond conventional approaches.

[85] Instruction Agent: Enhancing Agent with Expert Demonstration

Yinheng Li,Hailey Hultquist,Justin Wagle,Kazuhito Koishida

Main category: cs.AI

TL;DR: Instruction Agent通过利用专家演示来解决GUI代理在复杂任务中的挑战,提取逐步指令并严格遵循用户预期轨迹执行任务,显著提高了任务成功率。

Details Motivation: 当前GUI代理在复杂任务中表现不佳,尤其是在面对新UI元素、长时程动作和个性化轨迹时。Instruction Agent旨在通过专家演示解决这些问题。

Contribution: 提出了Instruction Agent,通过提取专家演示中的逐步指令并结合验证器和回溯器模块,显著提高了GUI代理的稳健性和任务成功率。

Method: 通过单次演示提取指令,严格遵循用户预期轨迹执行任务。引入验证器和回溯器模块以处理执行中的意外情况(如弹窗)。

Result: 在OSWorld数据集上,Instruction Agent的某些任务成功率高达60%,显著优于其他顶级代理。

Insight: 结合专家演示和稳健的执行模块(如验证器和回溯器)可以显著提升GUI代理的性能,为解决复杂任务提供了一种可扩展的框架。

Abstract: Graphical user interface (GUI) agents have advanced rapidly but still struggle with complex tasks involving novel UI elements, long-horizon actions, and personalized trajectories. In this work, we introduce Instruction Agent, a GUI agent that leverages expert demonstrations to solve such tasks, enabling completion of otherwise difficult workflows. Given a single demonstration, the agent extracts step-by-step instructions and executes them by strictly following the trajectory intended by the user, which avoids making mistakes during execution. The agent leverages the verifier and backtracker modules further to improve robustness. Both modules are critical to understand the current outcome from each action and handle unexpected interruptions(such as pop-up windows) during execution. Our experiments show that Instruction Agent achieves a 60% success rate on a set of tasks in OSWorld that all top-ranked agents failed to complete. The Instruction Agent offers a practical and extensible framework, bridging the gap between current GUI agents and reliable real-world GUI task automation.

[86] Neuro-Symbolic Frameworks: Conceptual Characterization and Empirical Comparative Analysis

Sania Sinha,Tanawan Premsri,Danial Kamali,Parisa Kordjamshidi

Main category: cs.AI

TL;DR: 本文综述了神经符号(NeSy)框架的技术特征,展示了三种代表性框架(DeepProbLog、Scallop、DomiKnowS),并分析了它们在解决复杂问题时面临的挑战。

Details Motivation: 神经符号计算结合了神经网络的灵活性与符号系统的可解释性,能够高效解决复杂问题,但现有研究多聚焦算法而非通用框架,缺乏易用工具。

Contribution: 1. 对现有NeSy框架的技术特征进行了系统归纳;2. 展示了三种通用框架的案例;3. 指出了各框架的表达能力与挑战。

Method: 通过分析符号表示语言、神经网络集成方式及底层算法,比较了DeepProbLog、Scallop和DomiKnowS框架的优缺点。

Result: 揭示了现有NeSy框架在通用性、表达能力和易用性方面的局限性,为未来研究指明了方向。

Insight: 未来研究需更注重开发用户友好的通用框架,以提升神经符号计算的实用性和可扩展性。

Abstract: Neurosymbolic (NeSy) frameworks combine neural representations and learning with symbolic representations and reasoning. Combining the reasoning capacities, explainability, and interpretability of symbolic processing with the flexibility and power of neural computing allows us to solve complex problems with more reliability while being data-efficient. However, this recently growing topic poses a challenge to developers with its learning curve, lack of user-friendly tools, libraries, and unifying frameworks. In this paper, we characterize the technical facets of existing NeSy frameworks, such as the symbolic representation language, integration with neural models, and the underlying algorithms. A majority of the NeSy research focuses on algorithms instead of providing generic frameworks for declarative problem specification to leverage problem solving. To highlight the key aspects of Neurosymbolic modeling, we showcase three generic NeSy frameworks - \textit{DeepProbLog}, \textit{Scallop}, and \textit{DomiKnowS}. We identify the challenges within each facet that lay the foundation for identifying the expressivity of each framework in solving a variety of problems. Building on this foundation, we aim to spark transformative action and encourage the community to rethink this problem in novel ways.

[87] Language Self-Play For Data-Free Training

Jakub Grudzien Kuba,Mengting Gu,Qi Ma,Yuandong Tian,Vijai Mohan

Main category: cs.AI

TL;DR: 本文提出了一种名为Language Self-Play (LSP)的强化学习方法,通过自博弈(self-play)框架,使模型无需额外数据即可自我提升,在指令跟随任务上超越了数据驱动的基线方法。

Details Motivation: 大型语言模型(LLMs)的发展受限于对大量高质量数据的依赖。本文提出了一种无需额外数据的训练方法,以解决这一瓶颈问题。

Contribution: 1. 提出了Language Self-Play (LSP)方法,通过自博弈实现模型自我提升;2. 实验证明LSP在指令跟随任务上优于数据驱动的基线方法。

Method: 采用游戏论中的自博弈框架,将模型性能视为竞争性游戏的结果,模型通过与自身对弈(即自博弈)来生成更强的策略。

Result: 实验显示,Llama-3.2-3B-Instruct模型通过LSP能够在指令跟随任务上显著提升性能,且效果优于数据驱动的基线方法。

Insight: 自博弈框架为语言模型的无数据训练提供了新思路,展示了模型自我优化的潜力,可能减少对大规模标注数据的依赖。

Abstract: Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model’s capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself - a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained models can not only enhance their performance on challenging tasks through self-play alone, but can also do so more effectively than data-driven baselines.

eess.IV [Back]

[88] Enhanced SegNet with Integrated Grad-CAM for Interpretable Retinal Layer Segmentation in OCT Images

S M Asiful Islam Saky,Ugyen Tshering

Main category: eess.IV

TL;DR: 该论文提出了一种改进的SegNet框架,用于视网膜层在OCT图像中的自动且可解释的分割。结合Grad-CAM增强模型的可解释性,并在Duke OCT数据集上取得了高性能。

Details Motivation: 手动分割视网膜层耗时且结果不一致,而传统深度学习模型缺乏可解释性。因此,需要一个既能实现高精度分割又能提供临床可验证解释的方法。

Contribution: 1. 提出改进的SegNet架构,增强了特征提取能力;2. 设计了混合损失函数(交叉熵+Dice损失),优化了对薄层和不平衡数据的性能;3. 集成了Grad-CAM以提供可视化解释。

Method: 1. 改进SegNet的池化策略;2. 使用混合损失函数;3. 集成Grad-CAM生成视觉解释。

Result: 在Duke OCT数据集上,验证准确率为95.77%,Dice系数0.9446,IoU为0.8951。Grad-CAM可视化显示出解剖学相关的区域。

Insight: 该研究展示了如何通过架构改进、损失函数优化和可解释性技术的结合,解决医学图像分割中的精度与可解释性问题,具有临床应用潜力。

Abstract: Optical Coherence Tomography (OCT) is essential for diagnosing conditions such as glaucoma, diabetic retinopathy, and age-related macular degeneration. Accurate retinal layer segmentation enables quantitative biomarkers critical for clinical decision-making, but manual segmentation is time-consuming and variable, while conventional deep learning models often lack interpretability. This work proposes an improved SegNet-based deep learning framework for automated and interpretable retinal layer segmentation. Architectural innovations, including modified pooling strategies, enhance feature extraction from noisy OCT images, while a hybrid loss function combining categorical cross-entropy and Dice loss improves performance for thin and imbalanced retinal layers. Gradient-weighted Class Activation Mapping (Grad-CAM) is integrated to provide visual explanations, allowing clinical validation of model decisions. Trained and validated on the Duke OCT dataset, the framework achieved 95.77% validation accuracy, a Dice coefficient of 0.9446, and a Jaccard Index (IoU) of 0.8951. Class-wise results confirmed robust performance across most layers, with challenges remaining for thinner boundaries. Grad-CAM visualizations highlighted anatomically relevant regions, aligning segmentation with clinical biomarkers and improving transparency. By combining architectural improvements, a customized hybrid loss, and explainable AI, this study delivers a high-performing SegNet-based framework that bridges the gap between accuracy and interpretability. The approach offers strong potential for standardizing OCT analysis, enhancing diagnostic efficiency, and fostering clinical trust in AI-driven ophthalmic tools.

physics.ao-ph [Back]

[89] Understanding Ice Crystal Habit Diversity with Self-Supervised Learning

Joseph Ko,Hariprasath Govindarajan,Fredrik Lindsten,Vanessa Przybylo,Kara Sulia,Marcus van Lier-Walqui,Kara Lamb

Main category: physics.ao-ph

TL;DR: 该论文提出了一种利用自监督学习(SSL)从冰晶图像中学习潜在表示的方法,用于改进冰晶形态的表征及其在气候建模中的应用。

Details Motivation: 冰晶形态的多样性对气候建模提出了挑战,而传统方法难以捕捉这种多样性。论文旨在通过SSL技术解决这一问题。

Contribution: 主要贡献包括验证SSL能学习有意义的冰晶形态表示,并展示这些表示在量化冰晶多样性中的应用。

Method: 使用视觉变换器(Vision Transformer)预训练大量冰晶图像,通过SSL学习鲁棒的形态表征。

Result: 结果表明,SSL驱动的表示方法能有效改进冰晶的表征,从而更好地约束其在地球气候系统中的作用。

Insight: SSL为处理复杂的自然现象(如冰晶多样性)提供了一种新工具,展示了其在科学任务中的潜力。

Abstract: Ice-containing clouds strongly impact climate, but they are hard to model due to ice crystal habit (i.e., shape) diversity. We use self-supervised learning (SSL) to learn latent representations of crystals from ice crystal imagery. By pre-training a vision transformer with many cloud particle images, we learn robust representations of crystal morphology, which can be used for various science-driven tasks. Our key contributions include (1) validating that our SSL approach can be used to learn meaningful representations, and (2) presenting a relevant application where we quantify ice crystal diversity with these latent representations. Our results demonstrate the power of SSL-driven representations to improve the characterization of ice crystals and subsequently constrain their role in Earth’s climate system.

cs.IR [Back]

[90] Beyond Sequential Reranking: Reranker-Guided Search Improves Reasoning Intensive Retrieval

Haike Xu,Tong Chen

Main category: cs.IR

TL;DR: Reranker-Guided-Search (RGS) 提出了一种新的检索方法,绕过传统检索-重排序管道的限制,通过直接基于重排序模型的偏好检索文档,利用近似最近邻图的贪婪搜索策略提高检索效率。

Details Motivation: 传统的检索-重排序管道受限于初始检索质量和LLM重排序器的计算负担,无法处理大量文档。

Contribution: 提出Reranker-Guided-Search (RGS)方法,利用近似最近邻图的贪婪搜索策略优化文档选择,显著提升检索精度。

Method: 使用近似最近邻算法构建邻接图,并通过贪婪搜索优先选择相似度高的文档进行重排序。

Result: 在多个基准测试中显著提升性能:BRIGHT提升3.5分,FollowIR提升2.9分,M-BEIR提升5.1分,且仅需重排序100份文档。

Insight: 在固定的嵌入和重排序模型下,通过策略性选择文档重排序可显著提高有限预算下的检索精度。

Abstract: The widely used retrieve-and-rerank pipeline faces two critical limitations: they are constrained by the initial retrieval quality of the top-k documents, and the growing computational demands of LLM-based rerankers restrict the number of documents that can be effectively processed. We introduce Reranker-Guided-Search (RGS), a novel approach that bypasses these limitations by directly retrieving documents according to reranker preferences rather than following the traditional sequential reranking method. Our method uses a greedy search on proximity graphs generated by approximate nearest neighbor algorithms, strategically prioritizing promising documents for reranking based on document similarity. Experimental results demonstrate substantial performance improvements across multiple benchmarks: 3.5 points on BRIGHT, 2.9 on FollowIR, and 5.1 on M-BEIR, all within a constrained reranker budget of 100 documents. Our analysis suggests that, given a fixed pair of embedding and reranker models, strategically selecting documents to rerank can significantly improve retrieval accuracy under limited reranker budget.

eess.SY [Back]

[91] A smart fridge with AI-enabled food computing

Khue Nong Thuc,Khoa Tran Nguyen Anh,Tai Nguyen Huy,Du Nguyen Hao Hong,Khanh Dinh Ba

Main category: eess.SY

TL;DR: 本文提出了一种基于AI的智能冰箱系统,通过物联网和计算机视觉技术实现实时食物检测、库存跟踪和温度监控。系统解决了高密度库存条件下物体检测的挑战,并通过改进的损失函数提升检测可靠性。

Details Motivation: 物联网和AI技术在家庭自动化中的应用日益重要,尤其是食物管理领域。高密度库存和复杂场景下的物体检测问题亟待解决,以减少浪费并优化消费。

Contribution: 1) 提出了一个集成了物联网和计算机视觉的智能冰箱系统;2) 设计了一种改进的焦点损失函数,解决了模型在复杂场景下的过度自信问题;3) 通过温度缩放等技术提升了检测的可靠性。

Method: 系统分为数据预处理、物体检测与管理、基于web的可视化三个模块。为解决多分类问题中的模型校准问题,使用了改进的焦点损失函数和自适应温度缩放技术。

Result: 实验结果表明,改进的系统在不同光照和规模挑战下显著提升了检测可靠性,同时减少了食物浪费并优化了消费行为。

Insight: 通过结合物联网和计算机视觉技术,智能冰箱系统不仅能提升家庭生活效率,还能推动可持续生活方式的发展。

Abstract: The Internet of Things (IoT) plays a crucial role in enabling seamless connectivity and intelligent home automation, particularly in food management. By integrating IoT with computer vision, the smart fridge employs an ESP32-CAM to establish a monitoring subsystem that enhances food management efficiency through real-time food detection, inventory tracking, and temperature monitoring. This benefits waste reduction, grocery planning improvement, and household consumption optimization. In high-density inventory conditions, capturing partial or layered images complicates object detection, as overlapping items and occluded views hinder accurate identification and counting. Besides, varied angles and obscured details in multi-layered setups reduce algorithm reliability, often resulting in miscounts or misclassifications. Our proposed system is structured into three core modules: data pre-processing, object detection and management, and a web-based visualization. To address the challenge of poor model calibration caused by overconfident predictions, we implement a variant of focal loss that mitigates over-confidence and under-confidence in multi-category classification. This approach incorporates adaptive, class-wise error calibration via temperature scaling and evaluates the distribution of predicted probabilities across methods. Our results demonstrate that robust functional calibration significantly improves detection reliability under varying lighting conditions and scalability challenges. Further analysis demonstrates a practical, user-focused approach to modern food management, advancing sustainable living goals through reduced waste and more informed consumption.