Table of Contents

cs.CL [Back]

[1] Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence

Lívia Dutra,Arthur Lorenzi,Laís Berno,Franciany Campos,Karoline Biscardi,Kenneth Brown,Marcelo Viridiano,Frederico Belcavello,Ely Matos,Olívia Guaranha,Erik Santos,Sofia Reinach,Tiago Timponi Torrent

Main category: cs.CL

TL;DR: 本文提出了一种基于语义框架的方法,用于识别医疗领域中的应报告事件(如基于性别的暴力事件),方法在巴西葡萄牙语的非结构化电子病历中应用,结果表明其识别精度为0.726。

Details Motivation: 医疗系统中存在应报告事件(如基于性别的暴力事件)的漏报问题,当前方法依赖人工审查,效率低下且难以扩展。本文旨在利用自然语言处理(NLP)技术自动化这一过程。

Contribution: 提出了一种基于语义框架的透明、高效且语言无关的方法,用于从非结构化电子病历中识别应报告事件,并通过实验验证了其有效性。

Method: 定义了8种语义框架模式,并在2100万巴西葡萄牙语句子的语料库中进行搜索。方法结合了语言学分析和机器学习,结果由语言学家手动评估。

Result: 方法的识别精度达到0.726,证实了其在识别基于性别的暴力事件中的有效性。

Insight: 该方法不仅适用于当前研究背景,还可扩展到其他公共卫生监测场景,展示了NLP技术在公共健康系统中透明且伦理的应用潜力。

Abstract: We introduce a methodology for the identification of notifiable events in the domain of healthcare. The methodology harnesses semantic frames to define fine-grained patterns and search them in unstructured data, namely, open-text fields in e-medical records. We apply the methodology to the problem of underreporting of gender-based violence (GBV) in e-medical records produced during patients’ visits to primary care units. A total of eight patterns are defined and searched on a corpus of 21 million sentences in Brazilian Portuguese extracted from e-SUS APS. The results are manually evaluated by linguists and the precision of each pattern measured. Our findings reveal that the methodology effectively identifies reports of violence with a precision of 0.726, confirming its robustness. Designed as a transparent, efficient, low-carbon, and language-agnostic pipeline, the approach can be easily adapted to other health surveillance contexts, contributing to the broader, ethical, and explainable use of NLP in public health systems.

[2] Semantically-Aware LLM Agent to Enhance Privacy in Conversational AI Services

Jayden Serenari,Stephen Lee

Main category: cs.CL

TL;DR: 该论文提出了LOPSIDED框架,一种语义感知的隐私代理,用于在远程大语言模型(LLMs)交互中动态替换敏感个人身份信息(PII)为语义一致的假名,以保护用户隐私,同时保持对话的上下文完整性。

Details Motivation: 随着对话式AI系统的普及,用户在与LLMs交互时可能泄露敏感个人数据(PII),引发隐私泄露风险。现有方法可能降低回复质量,因此需要一种既能保护隐私又不破坏语义的方法。

Contribution: 提出了LOPSIDED框架,通过动态替换和后续还原PII,实现了隐私保护与语义完整性的平衡,显著减少了语义错误。

Method: 使用语义一致的假名动态替换用户输入中的敏感PII,LLM生成回复后再将假名还原为原始PII。

Result: 在ShareGPT的真实对话数据上测试,LOPSIDED将语义错误减少了5倍,同时提升了隐私保护效果。

Insight: 动态替换和还原PII是一种有效的方法,可以在不损害对话质量的前提下保护用户隐私。

Abstract: With the increasing use of conversational AI systems, there is growing concern over privacy leaks, especially when users share sensitive personal data in interactions with Large Language Models (LLMs). Conversations shared with these models may contain Personally Identifiable Information (PII), which, if exposed, could lead to security breaches or identity theft. To address this challenge, we present the Local Optimizations for Pseudonymization with Semantic Integrity Directed Entity Detection (LOPSIDED) framework, a semantically-aware privacy agent designed to safeguard sensitive PII data when using remote LLMs. Unlike prior work that often degrade response quality, our approach dynamically replaces sensitive PII entities in user prompts with semantically consistent pseudonyms, preserving the contextual integrity of conversations. Once the model generates its response, the pseudonyms are automatically depseudonymized, ensuring the user receives an accurate, privacy-preserving output. We evaluate our approach using real-world conversations sourced from ShareGPT, which we further augment and annotate to assess whether named entities are contextually relevant to the model’s response. Our results show that LOPSIDED reduces semantic utility errors by a factor of 5 compared to baseline techniques, all while enhancing privacy.

[3] Dataset Creation and Baseline Models for Sexism Detection in Hausa

Fatima Adam Muhammad,Shamsuddeen Muhammad Hassan,Isa Inuwa-Dutse

Main category: cs.CL

TL;DR: 论文介绍了首个豪萨语性别歧视检测数据集,并通过社区参与、定性编码和数据增强等方法开发。研究探讨了豪萨语中性别歧视的表达方式,并比较了传统机器学习方法与预训练多语言模型的性能。

Details Motivation: 性别歧视在网络平台上普遍存在,但在低资源语言(如豪萨语)中的检测方法研究较少,主要受限于语言资源和文化差异。

Contribution: 1)创建首个豪萨语性别歧视检测数据集;2)通过用户研究捕捉文化差异和语言表达;3)评估传统机器学习和预训练模型在豪萨语性别歧视检测中的表现。

Method: 1)通过社区参与和定性编码构建数据集;2)进行两阶段用户研究(n=66)了解文化背景;3)使用传统机器学习和预训练多语言模型进行实验。

Result: 研究发现文化差异对检测效果有显著影响,尤其是在澄清性问题和习语表达上容易出现误判。

Insight: 1)文化背景对性别歧视的表达和检测至关重要;2)低资源语言中的性别歧视检测需要更多本土化研究。

Abstract: Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.

[4] Quantitative Intertextuality from the Digital Humanities Perspective: A Survey

Siyu Duan

Main category: cs.CL

TL;DR: 本文综述了定量互文性研究的现状和发展趋势,总结了其在数据、方法和应用方面的进展,并展望了未来在跨学科研究中的潜力。

Details Motivation: 互文性是文学理论中的重要概念,随着自然语言处理技术的发展,定量互文性研究逐渐兴起,为数字人文和跨学科研究提供了新的工具和方法。

Contribution: 本文的主要贡献包括:1) 总结了定量互文性研究的数据来源和方法演变;2) 综述了其在人文社科研究中的应用;3) 展望了未来技术驱动下的研究方向。

Method: 本文综合分析了从统计学到深度学习的多种方法,并结合多语言和多主题的数据,探讨了定量互文性研究的实现路径和技术进展。

Result: 研究发现,定量互文性研究在计算机技术的推动下变得更精确、多样化和规模化,且在跨学科研究中具有广泛应用前景。

Insight: 互文性研究为连接人工智能与人文学科提供了新的可能性,未来有望在更广泛的领域发挥作用。

Abstract: The connection between texts is referred to as intertextuality in literary theory, which served as an important theoretical basis in many digital humanities studies. Over the past decade, advancements in natural language processing have ushered intertextuality studies into the quantitative age. Large-scale intertextuality research based on cutting-edge methods has continuously emerged. This paper provides a roadmap for quantitative intertextuality studies, summarizing their data, methods, and applications. Drawing on data from multiple languages and topics, this survey reviews methods from statistics to deep learning. It also summarizes their applications in humanities and social sciences research and the associated platform tools. Driven by advances in computer technology, more precise, diverse, and large-scale intertext studies can be anticipated. Intertextuality holds promise for broader application in interdisciplinary research bridging AI and the humanities.

[5] MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models

Zixin Chen,Hongzhan Lin,Kaixin Li,Ziyang Luo,Yayue Deng,Jing Ma

Main category: cs.CL

TL;DR: 论文提出了MemeArena框架,用于多模态大语言模型(mLLMs)对有害内容理解的无偏评估,通过模拟多样上下文任务和共识机制实现公平比较,实验显示其评估结果接近人类偏好。

Details Motivation: 现有评估方法多关注二分类任务,无法反映mLLMs对多元上下文有害性的深度理解,需开发更全面的评估框架。

Contribution: 提出了MemeArena,首个基于代理的竞技场式评估框架,支持上下文感知和无偏的mLLMs有害性理解评估。

Method: 模拟多样上下文任务,整合多视角分析并由评估者达成共识,实现公平比较。

Result: 实验表明MemeArena有效减少评估偏误,结果与人类偏好高度一致。

Insight: 共识机制和多视角整合是实现可靠mLLMs评估的关键,为多模态有害性理解提供了新评测方法。

Abstract: The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs’ detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs’ understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs’ abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.

[6] Identifying the Periodicity of Information in Natural Language

Yulin Ou,Yu Wang,Yang Xu,Hendrik Buschmeier

Main category: cs.CL

TL;DR: 本文提出了一种名为AutoPeriod of Surprisal (APS)的新方法,用于检测自然语言中信息编码的周期性模式,发现许多文本存在显著周期性,并揭示了这些周期与文本结构单元的分布不完全一致的结果。

Details Motivation: 研究自然语言是否在编码信息时表现出周期性模式,并探索这些周期性是否源于文本结构或其他驱动因素。

Contribution: 提出APS方法,首次系统性地检测自然语言信息的周期性,发现新的周期性模式超出传统文本结构单元的解释范围。

Method: 采用经典周期性检测算法,结合Surprisal序列分析,通过谐波回归建模确认新发现的周期性模式。

Result: 发现许多文本表现出强周期性;周期与句子边界等传统结构单元不完全一致;周期性是结构化因素与长距离驱动因素共同作用的结果。

Insight: 语言信息的周期性不仅源于文本结构,还可能反映更深层次的认知或语用规律,为LLM生成文本检测提供新思路。

Abstract: Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.

[7] Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli,Alireza Salemi,Carrie Ye,Mohamed Abdalla,Hamed Zamani,J Ross Mitchell

Main category: cs.CL

TL;DR: 这篇论文提出了BEAM基准测试和LIGHT框架,用于评测和提升大型语言模型(LLM)在长时记忆任务中的表现。BEAM生成长达10M token的连贯对话和多样化问题,而LIGHT通过三种互补的记忆系统提升模型性能。

Details Motivation: 现有评测基准缺乏叙述连贯性、覆盖领域狭窄且仅测试简单回忆任务,限制了LLM在长时记忆任务中的应用潜力。

Contribution: 1. 提出BEAM基准测试,包含100个对话和2000个验证问题;2. 设计LIGHT框架,结合长时记忆、短时记忆和记录板三种记忆系统提升LLM性能。

Method: 1. 自动生成长达10M token的连贯对话和多样化问题;2. 通过LIGHT框架(长时记忆、短时记忆、记录板)增强LLM的记忆能力。

Result: 实验表明,即使支持1M token上下文窗口的LLM也难以应对长对话;LIGHT显著提升性能,平均改进3.5%-12.69%,各记忆组件均有贡献。

Insight: 引入多模态记忆系统(类似人类认知)是提升LLM长时记忆能力的有效途径。

Abstract: Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

[8] Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

Rajan Agarwal,Aarush Gupta

Main category: cs.CL

TL;DR: 该论文提出了一种名为LLINK的高效跨语言对齐方法,通过将低资源语言视为一种模态,利用轻量级对比投影器对齐多语言编码器的句子嵌入到解码器的潜在嵌入空间,显著提升了双语检索和问答任务的性能。

Details Motivation: 指令调优的大语言模型(LLMs)在低资源、非拉丁脚本语言上表现不佳,主要原因是分词碎片化和跨语言耦合弱。

Contribution: 提出了LLINK方法,通过轻量级适配器和对比投影器实现跨语言对齐,无需改变分词器或重新训练解码器。

Method: 1. 将冻结多语言编码器的句子嵌入通过对比投影器对齐到解码器的潜在嵌入空间;2. 将向量扩展为K个软槽位,通过轻量级适配器让冻结的解码器利用该信号。

Result: 双语检索性能显著提升,在基于LLM的问答评估中,81.3%的偏好优于基础模型,63.6%优于直接微调。

Insight: 将低资源语言视为模态是提升跨语言对齐的有效方法,但模型在数值保真度方面仍有不足。

Abstract: Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder’s latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.

[9] MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

Kangkun Mao,Jinru Ding,Jiayuan Chen,Mouxiao Bian,Ruiyao Chen,Xinwei Peng,Sijie Ren,Linyang Li,Jie Xu

Main category: cs.CL

TL;DR: 该论文提出了MedCalc-Eval和MedCalc-Env,前者是一个大型医学计算能力评测基准,后者是一个基于强化学习的环境,用于提升大语言模型(LLMs)在医学计算任务中的表现。

Details Motivation: 现有的医学评测基准主要关注问答或描述性推理,忽视了临床决策中关键的定量推理能力。已有的数据集(如MedCalc-Bench)覆盖的计算任务较少,难以反映真实世界的计算场景。

Contribution: 1. 引入了MedCalc-Eval,最大的医学计算能力评测基准,包含700+任务,涵盖多种医学专业。2. 开发了MedCalc-Env,一个基于强化学习的环境,用于提升LLMs在医学计算中的性能。

Method: 1. MedCalc-Eval包含方程型和规则型两类任务,覆盖多种医学场景。2. MedCalc-Env基于InternBootcamp框架,支持多步临床推理和规划,并通过Fine-tuning Qwen2.5-32B模型提升性能。

Result: Fine-tuning后的Qwen2.5-32B模型在MedCalc-Eval上取得了最先进的结果,尤其在数值敏感性、公式选择和推理鲁棒性方面表现突出。

Insight: 医学计算任务仍需解决单位转换、多条件逻辑和上下文理解等挑战。MedCalc-Env为LLMs在医学领域的实用化提供了新方向。

Abstract: As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs’ medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.

[10] Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Deokhyung Kang,Seonjeong Hwang,Daehui Kim,Hyounghun Kim,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 论文探讨了多语言推理模型在多语言任务中性能差异的原因,并提出了一种选择性翻译的方法来缓解这一问题。

Details Motivation: 尽管推理语言模型在多语言任务中表现优异,但在低资源语言中的推理能力仍落后于高资源语言。研究试图揭示这一现象的根本原因,并提出解决方案。

Contribution: 论文揭示了多语言推理差距的主要原因在于语言理解的失败,并提出了一种名为选择性翻译的有效策略来检测并缓解这一问题。

Method: 通过评估多种检测方法,研究者发现监督学习方法最有效,并在此基础上提出了选择性翻译策略,仅在检测到语言理解失败时进行翻译。

Result: 实验结果表明,选择性翻译能将多语言推理差距缩小至接近全翻译性能,同时仅需翻译20%的输入。

Insight: 语言理解的失败是多语言推理差距的主要原因,通过检测和选择性翻译可以显著缓解这一问题。

Abstract: Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model’s inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis.

[11] A Unified Representation Underlying the Judgment of Large Language Models

Yi-Long Lu,Jiajun Song,Wei Wang

Main category: cs.CL

TL;DR: 这篇论文探讨了大型语言模型(LLM)中判断是否依赖于统一表征而非独立模块,发现了一个主导维度——Valence-Assent Axis(VAA),并揭示了其对生成过程的控制机制。

Details Motivation: 研究旨在回答LLM的判断是基于独立模块还是统一表征的问题,并通过实验验证是否存在一个主导维度来编码主观评价和事实认可。

Contribution: 论文提出并验证了VAA的存在,揭示了其作为控制信号影响生成过程的机制,并解释了系统性偏见和幻觉的根源。

Method: 通过直接干预实验,研究者在多种LLM中发现VAA的存在,并分析了其对生成内容和推理过程的控制作用。

Result: 研究发现VAA是LLM中判断的主导维度,能够统一编码主观评价与事实认可,并通过控制生成过程导致系统性偏见。

Insight: 统一表征虽能促进一致性判断,但也可能导致推理过程的偏差,为LLM的系统性偏见和幻觉提供了机制性解释。

Abstract: A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence (“what is good”) and the model’s assent to factual claims (“what is true”). Through direct interventions, we show this unified representation creates a critical dependency: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. This mechanism, which we term the subordination of reasoning, shifts the process of reasoning from impartial inference toward goal-directed justification. Our discovery offers a mechanistic account for systemic bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.

[12] ThoughtProbe: Classifier-Guided LLM Thought Space Exploration via Probing Representations

Zijian Wang,Chang Xu

Main category: cs.CL

TL;DR: ThoughtProbe是一个新颖的推理框架,利用LLM的隐藏推理特征通过分类器引导树状响应空间探索,显著提升推理性能。

Details Motivation: 传统方法通过操作LLM的隐藏表示来引导生成,而ThoughtProbe将其作为判别信号,更高效地探索多路径推理空间。

Contribution: 1)提出分类器引导的树状空间探索方法;2)引入分支聚合机制,通过边际化CoT得分选择最优答案。

Method: 1)在节点扩展时用分类器评分并排序;2)完成后收集所有分支答案,聚合CoT得分选择最优解。

Result: 在多个算术推理基准测试中表现出显著改进,能有效覆盖并识别有效推理链。

Insight: 利用隐藏表示的判别性特征引导搜索,可以更高效地分配计算资源并提升推理准确性。

Abstract: This paper introduces ThoughtProbe, a novel inference time framework that leverages the hidden reasoning features of Large Language Models (LLMs) to improve their reasoning performance. Unlike previous works that manipulate the hidden representations to steer LLM generation, we harness them as discriminative signals to guide the tree structured response space exploration. In each node expansion, a classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by prioritizing higher score candidates for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We then propose a branch aggregation method that marginalizes over all supporting branches by aggregating their CoT scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework’s comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.

[13] VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision

Xuan Gong,Senmiao Wang,Hanbo Huang,Ruoyu Sun,Shiyu Liang

Main category: cs.CL

TL;DR: 论文提出VCORE方法,通过方差控制的优化重加权改进链式思维监督,显著提升LLM在复杂推理任务中的表现。

Details Motivation: 传统交叉熵损失对链式思维轨迹中所有token一视同仁,忽略了其对推理的不同贡献,导致监督分配不均和泛化能力弱。

Contribution: 提出VCORE框架,将链式思维监督重新定义为约束优化问题,实现自适应监督分配,提升推理泛化能力。

Method: 采用优化理论视角,通过方差控制的优化重加权方法动态调整token监督权重,更贴合推理目标。

Result: 在数学和编程基准测试中,VCORE显著优于现有方法,并在后续强化学习中表现出更强的初始化效果。

Insight: 自适应监督分配是关键,VCORE通过优化理论视角改进了传统方法的不足,为复杂推理任务提供了新思路。

Abstract: Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.

[14] Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning

Chenyang Shao,Sijian Ren,Fengli Xu,Yong Li

Main category: cs.CL

TL;DR: 本文提出了一种基于扩散语言模型(DLM)的高效协作推理框架,利用DLM并行生成多样候选思想,再通过大型语言模型(LLM)评估其质量,显著减轻了自回归生成的计算负担。

Details Motivation: 大型语言模型(LLM)在推理任务中表现优秀,但其自回归生成范式导致计算开销随推理步骤增加而急剧上升,性能提升却有限。扩散语言模型(DLM)能高效并行生成多样样本,为解决这一问题提供了可能。

Contribution: 提出了一个协作推理框架,结合DLM的高效思想提案和LLM的质量评估,显著降低了计算成本,同时保持了推理性能。

Method: 使用DLM并行生成多样候选思想,通过LLM评估思想质量,形成高效协作的推理流程。

Result: 实验表明,该框架在多种复杂推理任务中表现优异,为未来研究提供了新方向。

Insight: 结合不同模型的优势(DLM的高效生成和LLM的精确评估)是优化推理任务中计算开销和性能的可行途径。

Abstract: In recent years, large language models (LLMs) have witnessed remarkable advancements, with the test-time scaling law consistently enhancing the reasoning capabilities. Through systematic evaluation and exploration of a diverse spectrum of intermediate thoughts, LLMs demonstrate the potential to generate deliberate reasoning steps, thereby substantially enhancing reasoning accuracy. However, LLMs’ autoregressive generation paradigm results in reasoning performance scaling sub-optimally with test-time computation, often requiring excessive computational overhead to propose thoughts while yielding only marginal performance gains. In contrast, diffusion language models (DLMs) can efficiently produce diverse samples through parallel denoising in a single forward pass, inspiring us to leverage them for proposing intermediate thoughts, thereby alleviating the computational burden associated with autoregressive generation while maintaining quality. In this work, we propose an efficient collaborative reasoning framework, leveraging DLMs to generate candidate thoughts and LLMs to evaluate their quality. Experiments across diverse benchmarks demonstrate that our framework achieves strong performance in complex reasoning tasks, offering a promising direction for future research. Our code is open-source at https://anonymous.4open.science/r/Diffuse-Thinking-EC60.

[15] The aftermath of compounds: Investigating Compounds and their Semantic Representations

Swarang Joshi

Main category: cs.CL

TL;DR: 该研究比较了静态词向量(GloVe)和上下文嵌入(BERT)在英语复合词处理中与人类语义判断的一致性,发现BERT嵌入更能捕捉复合词的语义组合性。

Details Motivation: 研究旨在探究计算嵌入是否能够有效模拟人类在复合词处理中的语义判断,尤其是词素意义主导性(LMD)和语义透明度(ST)。

Contribution: 研究发现BERT嵌入在捕捉复合词的组合语义方面优于GloVe,并揭示了预测性评分是人类和模型数据中语义透明度的强预测因子。

Method: 研究通过关联强度(Edinburgh Associative Thesaurus)、频率(BNC)和预测性(LaDEC)等指标计算嵌入衍生的LMD和ST,并通过Spearman相关性和回归分析评估其与人类判断的关系。

Result: BERT嵌入比GloVe更好地捕捉了复合词的组合语义,且预测性评分对语义透明度的预测效果显著。

Insight: 研究为计算心理语言学提供了新视角,明确了影响复合词处理的驱动因素,并对嵌入语义建模提供了实用见解。

Abstract: This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.

[16] Effect of Domain Generalization Techniques in Low Resource Systems

Mahi Aminu,Chisom Chibuike,Fatimo Adebanjo,Omokolade Awosanya,Samuel Oyeneye

Main category: cs.CL

TL;DR: 该论文研究了在低资源系统中两种不同的因果领域泛化技术(因果数据增强和不变因果表示学习),证明了它们在情感分类和多语言情感分析任务中提高跨域鲁棒性的有效性。

Details Motivation: 现实场景中,训练和测试数据的分布往往不一致,尤其是低资源环境中数据稀缺和领域多样性有限,导致模型泛化能力较差。领域泛化技术(DG)通过学习跨领域不变特征来提升模型鲁棒性。

Contribution: 1. 分析了因果数据增强(CDA)在情感分类任务中的应用,通过生成反事实数据扩增训练集;2. 探索了不变因果表示学习(ICRL)在多语言情感分析中的适应性,基于DINER框架改进跨域性能。

Method: 1. 使用因果数据增强生成语义等价的反事实例子;2. 采用DINER框架进行不变因果表示学习,适应多语言场景。

Result: 两种方法均提升了模型在未见领域中的鲁棒性:CDA在情感分类中显著提升跨域准确率;ICRL在多语言情感分析中改善了分布外性能,但不同语言提升幅度有差异。

Insight: 因果方法在领域泛化中具有潜力,尤其是数据增强和表示学习的结合可能为低资源系统提供更通用的解决方案。

Abstract: Machine learning models typically assume that training and test data follow the same distribution, an assumption that often fails in real-world scenarios due to distribution shifts. This issue is especially pronounced in low-resource settings, where data scarcity and limited domain diversity hinder robust generalization. Domain generalization (DG) approaches address this challenge by learning features that remain invariant across domains, often using causal mechanisms to improve model robustness. In this study, we examine two distinct causal DG techniques in low-resource natural language tasks. First, we investigate a causal data augmentation (CDA) approach that automatically generates counterfactual examples to improve robustness to spurious correlations. We apply this method to sentiment classification on the NaijaSenti Twitter corpus, expanding the training data with semantically equivalent paraphrases to simulate controlled distribution shifts. Second, we explore an invariant causal representation learning (ICRL) approach using the DINER framework, originally proposed for debiasing aspect-based sentiment analysis. We adapt DINER to a multilingual setting. Our findings demonstrate that both approaches enhance robustness to unseen domains: counterfactual data augmentation yields consistent cross-domain accuracy gains in sentiment classification, while causal representation learning with DINER improves out-of-distribution performance in multilingual sentiment analysis, albeit with varying gains across languages.

[17] DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

Malik H. Altakrori,Nizar Habash,Abdelhakim Freihat,Younes Samih,Kirill Chirkunov,Muhammed AbuOdeh,Radu Florian,Teresa Lynn,Preslav Nakov,Alham Fikri Aji

Main category: cs.CL

TL;DR: 论文提出DialectalArabicMMLU,一个新的阿拉伯语方言能力评测基准,扩展了MMLU-Redux框架,涵盖五种主要方言,评估了19个模型,揭示了方言泛化的不足。

Details Motivation: 现有阿拉伯语和多语言评测基准主要针对现代标准阿拉伯语(MSA),缺乏对常用方言的支持,迫切需要填补这一空白。

Contribution: 开发了首个统一的人工校对资源DialectalArabicMMLU,用于评估阿拉伯语方言能力,覆盖5种方言和32个领域。

Method: 通过手动翻译和适配3K多选题到五种方言,构建15K QA对(含英语和MSA共22K),基于MMLU-Redux框架。

Result: 评估19个模型(1B-13B参数),发现不同方言间的性能差异显著,方言泛化能力不足。

Insight: 方言能力是当前语言模型的短板,需更多针对方言优化的模型设计和评测方法。

Abstract: We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.

[18] MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval

Qi Luo,Xiaonan Li,Yuxin Wang,Tingshuo Fan,Yuan Li,Xinchi Chen,Xipeng Qiu

Main category: cs.CL

TL;DR: MARAG-R1是一个基于强化学习的多工具RAG框架,旨在解决单一检索器在RAG系统中的局限性,通过动态协调多种检索机制实现更广泛和精确的信息获取。

Details Motivation: 大型语言模型(LLM)在推理和生成方面表现出色,但受限于静态预训练数据,导致事实错误和适应新信息的能力较弱。现有的RAG系统依赖单一检索器和固定的top-k选择,限制了信息的全面获取。

Contribution: MARAG-R1的主要贡献是提出了一种强化学习的多工具RAG框架,引入了四种检索工具(语义搜索、关键词搜索、过滤和聚合),并通过两阶段训练(监督微调和强化学习)动态协调这些工具。

Method: MARAG-R1通过两阶段训练实现动态检索:监督微调(学习如何使用工具)和强化学习(学习何时使用工具),从而实现推理与检索的交替进行。

Result: 在GlobalQA、HotpotQA和2WikiMultiHopQA上的实验表明,MARAG-R1显著优于基线方法,并在语料级推理任务中达到新的SOTA。

Insight: MARAG-R1的创新点在于将多工具检索与强化学习结合,突破了单一检索器的瓶颈,为LLM的更广泛信息获取提供了新思路。

Abstract: Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools – semantic search, keyword search, filtering, and aggregation – and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.

cs.CV [Back]

[19] Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Fenfen Lin,Yesheng Liu,Haiyu Xu,Chen Yue,Zheqi He,Mingxuan Zhao,Miguel Hu Chen,Jiakang Liu,JG Yao,Xi Yang

Main category: cs.CV

TL;DR: 论文提出了MeasureBench,一个评估视觉语言模型(VLMs)在视觉测量阅读任务中表现的基准测试,展示了当前VLMs在精细空间定位上的局限性。

Details Motivation: 尽管人类阅读测量仪器轻而易举,但当前的视觉语言模型在这一任务上表现不佳,尤其是在指针或对齐的关键位置识别上存在困难,因此需要一个专门的基准来评估和改进模型的性能。

Contribution: 1. 提出了MeasureBench,覆盖真实和合成图像的视觉测量阅读基准;2. 设计了可扩展的数据合成流程,生成可控视觉外观的测量仪器;3. 揭示了当前VLMs在精细空间定位上的主要失败模式。

Method: 1. 开发了一个可扩展的数据合成流程,生成具有可控视觉外观的测量仪器图像;2. 在MeasureBench上评估了流行的专有和开源VLMs;3. 进行了基于合成数据的强化学习初步实验。

Result: 结果显示,即使是前沿的VLMs也难以准确完成测量阅读任务,尤其是在指针定位和对齐上表现不佳。合成数据的强化学习在特定领域表现较好,但在真实图像上效果有限。

Insight: 论文揭示了当前VLMs在精细空间定位上的局限性,强调了视觉测量阅读任务的复杂性,为未来改进模型在空间感知和数值理解方面的能力提供了方向。

Abstract: Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.

[20] PF-DAformer: Proximal Femur Segmentation via Domain Adaptive Transformer for Dual-Center QCT

Rochak Dhakal,Chen Zhao,Zixin Shi,Joyce H. Keyak,Tadashi S. Kaneko,Kuan-Jui Su,Hui Shen,Hong-Wen Deng,Weihua Zhou

Main category: cs.CV

TL;DR: PF-DAformer提出了一种针对多机构QCT的近端股骨分割方法,通过对抗对齐和统计对齐解决域偏移问题,在双中心数据集上验证了有效性。

Details Motivation: QCT在评估骨强度和骨折风险中至关重要,但因域偏移问题,训练好的模型在不同数据集上表现不佳,限制了其在多中心骨质疏松研究中的应用。

Contribution: 提出了一种结合对抗对齐(GRL)和统计对齐(MMD)的域自适应Transformer框架,实现了扫描仪无关的特征学习并保留了解剖细节。

Method: 采用3D TransUNet作为主干网络,集成GRL对抗对齐和MMD统计对齐,以平衡不变性和细粒度对齐。

Result: 在包含两个中心的1,408个QCT扫描数据集上验证了模型的有效性,解决了跨机构域偏移问题。

Insight: 通过双对齐机制,模型能在多中心数据中稳定表现,为骨质疏松研究和临床决策提供了可靠工具。

Abstract: Quantitative computed tomography (QCT) plays a crucial role in assessing bone strength and fracture risk by enabling volumetric analysis of bone density distribution in the proximal femur. However, deploying automated segmentation models in practice remains difficult because deep networks trained on one dataset often fail when applied to another. This failure stems from domain shift, where scanners, reconstruction settings, and patient demographics vary across institutions, leading to unstable predictions and unreliable quantitative metrics. Overcoming this barrier is essential for multi-center osteoporosis research and for ensuring that radiomics and structural finite element analysis results remain reproducible across sites. In this work, we developed a domain-adaptive transformer segmentation framework tailored for multi-institutional QCT. Our model is trained and validated on one of the largest hip fracture related research cohorts to date, comprising 1,024 QCT images scans from Tulane University and 384 scans from Rochester, Minnesota for proximal femur segmentation. To address domain shift, we integrate two complementary strategies within a 3D TransUNet backbone: adversarial alignment via Gradient Reversal Layer (GRL), which discourages the network from encoding site-specific cues, and statistical alignment via Maximum Mean Discrepancy (MMD), which explicitly reduces distributional mismatches between institutions. This dual mechanism balances invariance and fine-grained alignment, enabling scanner-agnostic feature learning while preserving anatomical detail.

[21] DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting

Moonsoo Jeong,Dongbeen Kim,Minseong Kim,Sungkil Lee

Main category: cs.CV

TL;DR: DC4GS提出了一种基于方向一致性(DC)的自适应密度控制(ADC)方法,用于3D高斯泼溅(Gaussian Splatting)。与传统ADC仅依赖位置梯度大小不同,DC4GS引入梯度的角度一致性来优化ADC,减少冗余分裂并提升重建精度。

Details Motivation: 传统的ADC方法在3D高斯泼溅中仅基于位置梯度大小进行分裂,可能导致局部结构复杂度的捕捉不足和冗余分裂。DC4GS旨在通过引入梯度的方向一致性,更有效地控制密度并提升重建质量。

Contribution: 1. 提出方向一致性(DC)驱动的ADC方法;2. 通过梯度角度一致性改进分裂策略;3. 减少冗余分裂(实验最多减少30%的基元数量),同时提升重建保真度。

Method: 1. 将梯度的方向一致性(DC)纳入ADC;2. 使用梯度角度一致性优化分裂决策;3. 在需要分裂时,利用DC定义最佳分裂位置,使子基元更贴合局部结构。

Result: DC4GS显著减少了基元数量(最多30%),同时显著提升了重建的保真度。

Insight: 方向一致性(DC)是优化3D高斯泼溅中自适应密度控制的有效指标,能够同时减少计算开销并提升重建质量。

Abstract: We present a Directional Consistency (DC)-driven Adaptive Density Control (ADC) for 3D Gaussian Splatting (DC4GS). Whereas the conventional ADC bases its primitive splitting on the magnitudes of positional gradients, we further incorporate the DC of the gradients into ADC, and realize it through the angular coherence of the gradients. Our DC better captures local structural complexities in ADC, avoiding redundant splitting. When splitting is required, we again utilize the DC to define optimal split positions so that sub-primitives best align with the local structures than the conventional random placement. As a consequence, our DC4GS greatly reduces the number of primitives (up to 30% in our experiments) than the existing ADC, and also enhances reconstruction fidelity greatly.

[22] SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions

Md. Mehedi Hassan,Shafqat Alam,Shahriar Ahmed Seam,Maruf Ahmed

Main category: cs.CV

TL;DR: SYNAPSE-Net是一种统一的框架,通过病灶感知分层门控机制,实现了对多模态MRI中异质性脑病灶的稳健分割,展现了卓越的泛化能力和临床可靠性。

Details Motivation: 当前深度学习模型多为针对特定任务的‘点解决方案’,缺乏泛化能力且性能不稳定,限制了其临床应用的可靠性。

Contribution: 提出了一个统一的框架SYNAPSE-Net,整合了多流CNN编码器、Swin Transformer瓶颈、动态跨模态注意力融合机制和分层门控解码器,实现了对多种脑病灶的高效分割。

Method: 采用混合架构,结合多模态数据增强和难度感知采样策略进行训练,提升了模型的泛化能力和鲁棒性。

Result: 在三个公开数据集上取得了最先进的性能,如在WMH数据集上DSC值达0.831,HD95值为3.03。

Insight: 通过统一的框架设计,SYNAPSE-Net不仅提高了分割精度,还为临床提供了可靠且可行的自动化分割解决方案。

Abstract: Automated segmentation of heterogeneous brain lesions from multi-modal MRI remains a critical challenge in clinical neuroimaging. Current deep learning models are typically specialized `point solutions’ that lack generalization and high performance variance, limiting their clinical reliability. To address these gaps, we propose the Unified Multi-Stream SYNAPSE-Net, an adaptive framework designed for both generalization and robustness. The framework is built on a novel hybrid architecture integrating multi-stream CNN encoders, a Swin Transformer bottleneck for global context, a dynamic cross-modal attention fusion (CMAF) mechanism, and a hierarchical gated decoder for high-fidelity mask reconstruction. The architecture is trained with a variance reduction strategy that combines pathology specific data augmentation and difficulty-aware sampling method. The model was evaluated on three different challenging public datasets: the MICCAI 2017 WMH Challenge, the ISLES 2022 Challenge, and the BraTS 2020 Challenge. Our framework attained a state-of-the-art DSC value of 0.831 with the HD95 value of 3.03 in the WMH dataset. For ISLES 2022, it achieved the best boundary accuracy with a statistically significant difference (HD95 value of 9.69). For BraTS 2020, it reached the highest DSC value for the tumor core region (0.8651). These experimental findings suggest that our unified adaptive framework achieves state-of-the-art performance across multiple brain pathologies, providing a robust and clinically feasible solution for automated segmentation. The source code and the pre-trained models are available at https://github.com/mubid-01/SYNAPSE-Net-pre.

[23] Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

Anam Fatima,Yi Yu,Janak Kapuriya,Julien Lalanne,Jainendra Shukla

Main category: cs.CV

TL;DR: 该论文提出了一种基于语义帧聚合的Transformer模型(SFAT),用于实时视频评论生成,通过CLIP的多模态知识为视频帧分配权重,并采用加权和帧技术,增强了评论的上下文相关性。同时,作者构建了一个大规模英语视频评论数据集,验证了模型的有效性。

Details Motivation: 实时视频评论生成是一个新兴且有挑战性的任务,现有方法常忽略视频帧的语义相关性,导致生成的评论上下文不匹配。为了解决这一问题,作者提出了结合语义相关性和多模态信息的方法。

Contribution: 1. 提出了SFAT模型,通过语义帧聚合和加权技术生成更相关的评论;2. 构建了一个大规模多模态英语视频评论数据集;3. 展示了模型在生成视频评论中的优越性。

Method: 1. 利用CLIP的视觉-文本多模态知识为视频帧分配语义权重;2. 通过加权和帧技术突出关键帧;3. 使用带有跨注意力机制的评论解码器,整合视频和聊天模态的上下文信息。

Result: SFAT模型在生成的评论上下文相关性上优于现有方法,同时在大规模英语数据集上验证了其泛化能力。

Insight: 语义帧聚合和多模态信息的结合对实时视频评论生成至关重要,未来可以进一步扩展到更细粒度的语义理解和动态场景。

Abstract: Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP’s visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.

[24] MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

Arghavan Rezvani,Xiangyi Yan,Anthony T. Wu,Kun Han,Pooya Khosravi,Xiaohui Xie

Main category: cs.CV

TL;DR: MoME是一种用于医学图像分割的视觉语言专家混合模型,结合视觉和多尺度文本特征,动态选择专家,提升医学图像分析的性能。

Details Motivation: 医学图像分割需要处理复杂的视觉信息和丰富的文本描述,结合视觉语言模型和混合专家机制的潜力尚未充分探索。

Contribution: 1. 提出MoME,首次将MoE机制应用于医学视觉语言任务;2. 结合多尺度视觉特征和文本嵌入,动态选择专家;3. 在多个数据集上验证了其性能。

Method: 采用混合专家(MoE)架构,动态选择视觉和语言专家,利用多尺度视觉特征和文本嵌入优化分割任务。

Result: 在包含3,410个CT扫描的10个数据集中表现出色,证明了其在医学图像分割中的竞争力。

Insight: 结合视觉语言模型和MoE机制可以显著提升医学图像分割的性能,为医学图像分析提供了新思路。

Abstract: In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.

[25] Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning

Yana Wei,Zeen Chi,Chongyu Wang,Yu Wu,Shipeng Yan,Yongfei Liu,Xuming He

Main category: cs.CV

TL;DR: 该论文提出了一种增量式人-物体交互检测(IHOID)方法,通过不变关系表示学习解决动态开放环境中的交互漂移和零样本HOI组合问题。

Details Motivation: 在开放世界中,人-物体交互(HOI)持续变化,传统封闭世界模型无法适应,需要发展增量学习能力以应对动态环境。

Contribution: 提出了一种无范例的增量关系蒸馏(IRD)框架,解耦了物体与关系的学习,并引入两种蒸馏损失来学习不变关系特征。

Method: 通过IRD框架解耦学习和关系建模,针对共享关系的不同HOI组合,设计不变关系特征提取方法。

Result: 在HICO-DET和V-COCO数据集上表现优于现有基线,减少遗忘,提升对交互漂移的鲁棒性,并实现零样本HOI组合的泛化。

Insight: 解耦物体和关系学习,并专注于不变关系特征,是增量HOI检测的关键。

Abstract: In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans’ ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \href{https://github.com/weiyana/ContinualHOI}{this HTTP URL}

[26] VitalLens 2.0: High-Fidelity rPPG for Heart Rate Variability Estimation from Face Video

Philipp V. Rouast

Main category: cs.CV

TL;DR: VitalLens 2.0是一款基于深度学习的高精度远程光电容积描记术(rPPG)模型,能够从面部视频中估计心率、呼吸频率和心率变异性(HRV)指标,显著提升了现有方法的准确性。

Details Motivation: 传统的rPPG技术在估计心率变异性(HRV)等复杂生理信号时精度不足,VitalLens 2.0旨在通过改进模型架构和训练数据集来解决这一问题。

Contribution: 1. 提出了一种新型深度学习模型架构,显著提升了HRV等生理信号的估计精度。
2. 扩大了训练数据集的规模和多样性(包含1,413名独特个体)。
3. 在公开和私有数据集上验证了模型的性能,确立了新的技术标杆。

Method: 通过结合新的模型架构和更大规模的多样化训练数据(1,413名个体),实现了对HR、RR和HRV的高精度估计。

Result: 在测试集(422名个体)上,VitalLens 2.0的平均绝对误差(MAE)为:HR 1.57 bpm,RR 1.08 bpm,HRV-SDNN 10.18 ms,HRV-RMSSD 16.45 ms,显著优于现有方法。

Insight: 模型的高精度得益于数据多样性和规模的大幅提升,以及优化的架构设计,表明复杂生理信号的精确估计需要多方面的协同改进。

Abstract: This report introduces VitalLens 2.0, a new deep learning model for estimating physiological signals from face video. This new model demonstrates a significant leap in accuracy for remote photoplethysmography (rPPG), enabling the robust estimation of not only heart rate (HR) and respiratory rate (RR) but also Heart Rate Variability (HRV) metrics. This advance is achieved through a combination of a new model architecture and a substantial increase in the size and diversity of our training data, now totaling 1,413 unique individuals. We evaluate VitalLens 2.0 on a new, combined test set of 422 unique individuals from four public and private datasets. When averaging results by individual, VitalLens 2.0 achieves a Mean Absolute Error (MAE) of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD. These results represent a new state-of-the-art, significantly outperforming previous methods. This model is now available to developers via the VitalLens API at https://rouast.com/api.

[27] AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception

Mario Camarena,Het Patel,Fatemeh Nazari,Evangelos Papalexakis,Mohamadhossein Noruzoliaee,Jia Chen

Main category: cs.CV

TL;DR: AD-SAM是一种针对自动驾驶场景改进的Segment Anything Model(SAM),通过双编码器和可变形解码器提升了语义分割性能,在Cityscapes和BDD100K数据集上表现优异。

Details Motivation: 自动驾驶场景的空间和几何复杂性需要更高效的语义分割模型,而基础模型SAM在未微调时效果不佳。

Contribution: 提出了AD-SAM,通过双编码器融合多尺度特征、可变形解码器和混合损失函数,显著提升了分割性能和数据效率。

Method: 1. 双编码器:结合SAM的ViT-H全局语义与ResNet-50的局部空间细节;2. 可变形融合模块对齐特征;3. 混合损失优化训练。

Result: 在Cityscapes和BDD100K上分别取得68.1和59.5 mIoU,超越基线模型高达22.9和19.2 mIoU。

Insight: 基础模型的针对性改进能大幅提升自动驾驶场景的分割性能和数据效率。

Abstract: This paper presents the Autonomous Driving Segment Anything Model (AD-SAM), a fine-tuned vision foundation model for semantic segmentation in autonomous driving (AD). AD-SAM extends the Segment Anything Model (SAM) with a dual-encoder and deformable decoder tailored to spatial and geometric complexity of road scenes. The dual-encoder produces multi-scale fused representations by combining global semantic context from SAM’s pretrained Vision Transformer (ViT-H) with local spatial detail from a trainable convolutional deep learning backbone (i.e., ResNet-50). A deformable fusion module aligns heterogeneous features across scales and object geometries. The decoder performs progressive multi-stage refinement using deformable attention. Training is guided by a hybrid loss that integrates Focal, Dice, Lovasz-Softmax, and Surface losses, improving semantic class balance, boundary precision, and optimization stability. Experiments on the Cityscapes and Berkeley DeepDrive 100K (BDD100K) benchmarks show that AD-SAM surpasses SAM, Generalized SAM (G-SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy. It achieves 68.1 mean Intersection over Union (mIoU) on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G-SAM, and DeepLabV3 by margins of up to +22.9 and +19.2 mIoU in structured and diverse road scenes, respectively. AD-SAM demonstrates strong cross-domain generalization with a 0.87 retention score (vs. 0.76 for SAM), and faster, more stable learning dynamics, converging within 30-40 epochs, enjoying double the learning speed of benchmark models. It maintains 0.607 mIoU with only 1000 samples, suggesting data efficiency critical for reducing annotation costs. These results confirm that targeted architectural and optimization enhancements to foundation models enable reliable and scalable AD perception.

[28] Hierarchical Transformers for Unsupervised 3D Shape Abstraction

Aditya Vora,Lily Goli,Andrea Tagliasacchi,Hao Zhang

Main category: cs.CV

TL;DR: HiT是一种新颖的分层神经场表示方法,用于无监督3D形状抽象,通过分层变换器学习从粗到细的层次结构。

Details Motivation: 现有方法通常限制在固定层次结构(如二叉树),无法捕捉多样形状类别的通用层次结构。HiT旨在无监督学习多形状类别的通用层次关系。

Contribution: 提出分层变换器(HiT),通过压缩码本自动学习多形状类别的通用子结构,无需固定层次结构限制,仅约束每层节点总数。

Method: 使用分层变换器学习父子节点的树结构关系,通过压缩码本自动识别跨类别的子结构,仅限制每层节点数而非结构类型。

Result: 在ShapeNet的55个类别上进行无监督形状分割任务,成功将形状分割为多个粒度级别,展示了方法的有效性。

Insight: HiT的灵活性使其能直接从数据中学习通用复杂层次结构,优于固定层次结构的现有方法。

Abstract: We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.

[29] ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding

Haonan Wang,Jingyu Lu,Hongrui Li,Xiaomeng Li

Main category: cs.CV

TL;DR: ZEBRA提出了一种零样本跨被试通用的脑视觉解码框架,无需针对特定被试进行微调,通过解耦脑功能磁共振成像(fMRI)表示中的被试相关和语义相关成分,实现了对未见被试的泛化。

Details Motivation: 当前方法依赖被试特定模型或微调,限制了可扩展性和实际应用。ZEBRA旨在消除这种依赖性,推动通用神经解码的发展。

Contribution: 1. 提出了首个零样本脑视觉解码框架ZEBRA;2. 通过对抗训练解耦fMRI表示中的被试相关和语义相关成分;3. 在未见被试上实现与全微调模型相当的性能。

Method: 1. 将fMRI表示分解为被试相关和语义相关成分;2. 使用对抗训练解耦这些成分;3. 隔离出与语义相关且被试无关的表示用于解码。

Result: ZEBRA在零样本设置下显著优于基线,并在多个指标上接近全微调模型的性能。

Insight: fMRI表示的语义相关成分可以独立于被试进行提取,为零样本跨被试泛化提供了可能。

Abstract: Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: https://github.com/xmed-lab/ZEBRA.

[30] WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond

Zhicong Sun,Jacqueline Lo,Jinxing Hu

Main category: cs.CV

TL;DR: 本文提出了WildfireX-SLAM,这是一个用于野火环境和森林SLAM任务的大规模低空RGB-D合成数据集,填补了当前数据集的空白,并为3D高斯泼溅(3DGS)方法的扩展提供了支持。

Details Motivation: 现有的3DGS-based SLAM方法主要针对小规模室内场景,而在大规模森林环境中的应用受限,尤其是在野火应急响应和森林管理中。当前缺乏高质量的数据集是阻碍相关研究的主要障碍。

Contribution: 1. 提出了首个面向野火和森林环境的大规模、高质量合成RGB-D数据集WildfireX-SLAM。2. 提供灵活的环境控制(如光照、天气、野火类型),支持多种任务需求。3. 建立了一个全面的基准测试,揭示了3DGS-based SLAM在森林环境中的独特挑战和改进方向。

Method: 1. 利用Unreal Engine 5的Electric Dreams环境样本项目构建数据生成流程。2. 收集低空和地面视角的RGB-D图像,包括真实相机姿态和多模态数据。3. 设计灵活的环境参数控制,生成多样化的数据。

Result: 数据集WildfireX-SLAM包含5.5k低空RGB-D图像,覆盖16 km²的森林地图。基准测试展示了3DGS-based SLAM在森林环境中的挑战和改进潜力。

Insight: 1. 合成数据集填补了真实数据难以获取的空白,推动了大规模森林环境SLAM的研究。2. 环境因素的可控性为多任务研究(如野火监测)提供了便利。3. 结果表明3DGS方法在复杂的大规模场景中仍需进一步优化。

Abstract: 3D Gaussian splatting (3DGS) and its subsequent variants have led to remarkable progress in simultaneous localization and mapping (SLAM). While most recent 3DGS-based SLAM works focus on small-scale indoor scenes, developing 3DGS-based SLAM methods for large-scale forest scenes holds great potential for many real-world applications, especially for wildfire emergency response and forest management. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, and collecting such a dataset over real-world scenes is costly and technically infeasible. To this end, we have built a large-scale, comprehensive, and high-quality synthetic dataset for SLAM in wildfire and forest environments. Leveraging the Unreal Engine 5 Electric Dreams Environment Sample Project, we developed a pipeline to easily collect aerial and ground views, including ground-truth camera poses and a range of additional data modalities from unmanned aerial vehicle. Our pipeline also provides flexible controls on environmental factors such as light, weather, and types and conditions of wildfire, supporting the need for various tasks covering forest mapping, wildfire emergency response, and beyond. The resulting pilot dataset, WildfireX-SLAM, contains 5.5k low-altitude RGB-D aerial images from a large-scale forest map with a total size of 16 km2. On top of WildfireX-SLAM, a thorough benchmark is also conducted, which not only reveals the unique challenges of 3DGS-based SLAM in the forest but also highlights potential improvements for future works. The dataset and code will be publicly available. Project page: https://zhicongsun.github.io/wildfirexslam.

[31] E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

Tong Shen,Jingai Yu,Dong Zhou,Dong Li,Emad Barsoum

Main category: cs.CV

TL;DR: 论文提出了高效的轻量级多模态扩散模型E-MMDiT,能够在有限资源下快速生成高质量图像,通过token缩减和新型模块设计显著降低了计算成本。

Details Motivation: 现有扩散模型通常需要大规模训练数据和大量计算资源,或者结构复杂导致高延迟。E-MMDiT旨在解决这些问题,提供高效且轻量化的解决方案。

Contribution: 1) 提出E-MMDiT,仅需304M参数和有限资源;2) 设计压缩视觉tokenizer和多路径压缩模块减少计算成本;3) 引入Position Reinforcement和ASA模块增强空间一致性和计算效率;4) 提出AdaLN-affine轻量模块优化transformer块调制参数计算。

Method: 1) 使用高压缩视觉tokenizer和多路径压缩模块缩减token数量;2) 通过Position Reinforcement强化位置信息;3) 采用ASA在子区域进行注意力计算以降低成本;4) 提出AdaLN-affine模块高效计算调制参数。

Result: 在512px图像生成任务中,仅使用25M公开数据和8块AMD MI300X GPU训练1.5天,即达到GenEval 0.66分,通过GRPO等后训练技术提升至0.72。

Insight: E-MMDiT的核心是通过token缩减和新型注意力机制设计平衡计算效率与生成质量,为生成式AI模型的民主化提供了实用基础。

Abstract: Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.

[32] Generating Accurate and Detailed Captions for High-Resolution Images

Hankyeol Lee,Gawon Seo,Kyounggyu Lee,Dogun Kim,Kyungwoo Song,Jiyoung Jung

Main category: cs.CV

TL;DR: 该论文提出了一种新方法,通过结合视觉语言模型(VLM)、大语言模型(LLM)和目标检测系统,改进高分辨率图像的标题生成质量。该方法通过多阶段处理生成更详细且可靠的标题,并有效减少了幻觉现象。

Details Motivation: 现有的视觉语言模型通常在低分辨率图像上预训练,对高分辨率图像生成标题时会丢失细节或遗漏重要对象。论文旨在解决这一问题,提升高分辨率图像标题的准确性和细节描述。

Contribution: 提出了一个新颖的多阶段流水线,结合VLM、LLM和目标检测,生成更详细且可靠的标题;通过验证新对象并移除未检测到的对象引用,减少了幻觉现象。

Method: 1. 使用VLM生成初始标题;2. 通过LLM识别关键对象并预测共现对象;3. 验证预测对象;4. 对新检测到的对象进行区域特定标题生成。

Result: 实验表明,该方法在高分辨率图像数据集上生成了更详细、可靠的标题,同时有效减少了幻觉现象。

Insight: 结合多模态模型和目标检测可以显著提升高分辨率图像标题的质量,同时减少了幻觉问题的发生。

Abstract: Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.

[33] M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar

Xiaozhi Li,Huijun Di,Jian Li,Feng Liu,Wei Liang

Main category: cs.CV

TL;DR: M^3Detection 是一种多帧多模态 3D 目标检测框架,通过融合相机和 4D 成像雷达数据,实现了高效的全局和局部特征聚合,显著提升了检测性能。

Details Motivation: 现有的相机-雷达融合方法多为单帧输入,无法充分利用时空信息,且由于图像退化或雷达稀疏性导致检测性能受限。M^3Detection 旨在通过多帧融合和多级特征聚合解决这些问题。

Contribution: 提出了一种统一的多帧 3D 检测框架,设计了全局和局部特征聚合模块,以及轨迹级时空推理模块,显著提升了多模态数据融合的效果。

Method: 1. 基于基线检测器和跟踪器生成参考轨迹;2. 设计全局级和局部级特征聚合模块;3. 通过轨迹级时空推理模块编码跨帧交互。

Result: 在 VoD 和 TJ4DRadSet 数据集上实现了最先进的 3D 检测性能。

Insight: 多帧融合和多级特征聚合能有效弥补单帧信息的不足,同时计算效率的提升为实现实时多模态检测提供了可能。

Abstract: Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.

[34] DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model

Yucheng Xing,Jinxing Yin,Xiaodong Liu

Main category: cs.CV

TL;DR: DANCER提出了一种基于扩散模型的单人生成舞蹈动画框架,通过条件增强和渲染模块提升生成质量,并在真实数据集上表现优于现有方法。

Details Motivation: 生成高质量且连续的视频,尤其是涉及高自由度人体运动的舞蹈动画,是视频生成任务中的一大挑战。

Contribution: 1. 提出DANCER框架,结合外观增强模块(AEM)和姿态渲染模块(PRM)优化生成质量;2. 构建TikTok-3K数据集,增强模型训练。

Method: 1. 使用稳定视频扩散模型;2. 引入AEM增强参考图像细节;3. 通过PRM扩展运动引导,捕捉多域姿态条件。

Result: 在真实数据集上表现优于现有方法,生成效果更优。

Insight: 结合条件增强模块和多域姿态引导,能够显著提升高自由度运动的生成质量。

Abstract: Recently, diffusion models have shown their impressive ability in visual generation tasks. Besides static images, more and more research attentions have been drawn to the generation of realistic videos. The video generation not only has a higher requirement for the quality, but also brings a challenge in ensuring the video continuity. Among all the video generation tasks, human-involved contents, such as human dancing, are even more difficult to generate due to the high degrees of freedom associated with human motions. In this paper, we propose a novel framework, named as DANCER (Dance ANimation via Condition Enhancement and Rendering with Diffusion Model), for realistic single-person dance synthesis based on the most recent stable video diffusion model. As the video generation is generally guided by a reference image and a video sequence, we introduce two important modules into our framework to fully benefit from the two inputs. More specifically, we design an Appearance Enhancement Module (AEM) to focus more on the details of the reference image during the generation, and extend the motion guidance through a Pose Rendering Module (PRM) to capture pose conditions from extra domains. To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training. The effectiveness of the proposed model has been evaluated through extensive experiments on real-world datasets, where the performance of our model is superior to that of the state-of-the-art methods. All the data and codes will be released upon acceptance.

[35] H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models

Mingyu Sung,Il-Min Kim,Sangseok Yun,Jae-Mo Kang

Main category: cs.CV

TL;DR: H2-Cache提出了一种新颖的分层双阶段缓存机制,显著加速生成扩散模型的推理过程,同时保持高质量生成效果。

Details Motivation: 扩散模型在图像生成中表现优异,但迭代去噪过程计算成本高,现有缓存技术在加速推理时存在速度与保真度的权衡问题。

Contribution: 提出H2-Cache,通过双阈值系统和轻量级特征汇总技术(PFS),实现了高质量的加速推理。

Method: 将去噪过程分为结构定义和细节优化两个阶段,采用双阈值机制选择性缓存,结合PFS进行快速相似性估计。

Result: 实验表明,H2-Cache在Flux架构上实现了高达5.08倍的加速,同时保持与基线接近的图像质量。

Insight: 去噪过程的功能分离和轻量级特征汇总技术是解决速度与质量权衡的关键。

Abstract: Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache.

[36] SilhouetteTell: Practical Video Identification Leveraging Blurred Recordings of Video Subtitles

Guanchong Huang,Song Fang

Main category: cs.CV

TL;DR: SilhouetteTell是一种新颖的视频识别攻击方法,通过分析视频字幕的模糊记录(空间和时间信息)推断视频内容,威胁用户隐私。

Details Motivation: 视频识别攻击可能导致用户隐私泄露,现有方法多依赖网络流量分析,而本工作通过字幕轮廓的空间和时间特征提出了更通用的解决方案。

Contribution: 提出SilhouetteTell方法,结合字幕轮廓的空间和时间域信息,形成时空特征,适用于在线和离线视频识别。

Method: 通过字幕轮廓的时空相关性,将录制的字幕轮廓与视频的字幕文件匹配,并利用智能手机验证其有效性。

Result: 实验证明SilhouetteTell能在多种环境下(如40米距离)高效推断视频标题和片段。

Insight: 字幕轮廓的时空特征是视频识别的有力线索,该方法揭示了隐私保护中容易被忽视的薄弱环节。

Abstract: Video identification attacks pose a significant privacy threat that can reveal videos that victims watch, which may disclose their hobbies, religious beliefs, political leanings, sexual orientation, and health status. Also, video watching history can be used for user profiling or advertising and may result in cyberbullying, discrimination, or blackmail. Existing extensive video inference techniques usually depend on analyzing network traffic generated by streaming online videos. In this work, we observe that the content of a subtitle determines its silhouette displayed on the screen, and identifying each subtitle silhouette also derives the temporal difference between two consecutive subtitles. We then propose SilhouetteTell, a novel video identification attack that combines the spatial and time domain information into a spatiotemporal feature of subtitle silhouettes. SilhouetteTell explores the spatiotemporal correlation between recorded subtitle silhouettes of a video and its subtitle file. It can infer both online and offline videos. Comprehensive experiments on off-the-shelf smartphones confirm the high efficacy of SilhouetteTell for inferring video titles and clips under various settings, including from a distance of up to 40 meters.

[37] Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization

Guozheng Zheng,Jian Guan,Mingjie Xie,Xuanjia Zhao,Congyi Fan,Shiheng Zhang,Pengming Feng

Main category: cs.CV

TL;DR: 本文提出了一种双层级渐进式难度感知重加权策略(DPHR),用于解决跨视角地理定位(CVGL)中因视角差异和困难负样本带来的挑战。通过样本级别的相对难度评估和批次级别的渐进式自适应损失加权,DPHR在两大基准测试中显著优于现有方法。

Details Motivation: 跨视角地理定位任务中,无人机和卫星图像的视角差异及困难负样本的存在导致现有静态加权策略效果不佳,容易过早强调困难样本,引入噪声梯度并导致训练不稳定。

Contribution: 提出双层级渐进式难度感知重加权策略(DPHR),包括样本级别的Ratio-based Difficulty-Aware(RDA)模块和批次级别的Progressive Adaptive Loss Weighting(PALW)机制,增强了训练的稳定性和有效性。

Method: RDA模块通过评估负样本的相对难度分配细粒度权重;PALW机制利用训练进度信号动态调节困难样本的挖掘强度,早期减轻噪声梯度,后期加强困难样本学习。

Result: 在University-1652和SUES-200基准测试中,DPHR表现优于现有方法,验证了其有效性和鲁棒性。

Insight: 动态和渐进式的难度感知策略能够更好地平衡训练过程中的困难样本学习,避免过早关注噪声样本,从而提升模型性能。

Abstract: Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.

[38] Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

Zixuan Hu,Yongxian Wei,Li Shen,Zhenyi Wang,Lei Li,Chun Yuan,Dacheng Tao

Main category: cs.CV

TL;DR: 该论文提出了一种稀疏模型反演方法,通过选择性反演语义前景而非整个图像区域,显著提升了大型Vision Transformers在高分辨率图像反演中的效率。

Details Motivation: 传统密集反演方法因试图重建整个图像区域而效率低下,尤其是在处理高分辨率图像和大规模ViTs时。作者发现冗余的背景反演和虚假相关性的反演是主要原因。

Contribution: 提出了稀疏模型反演策略,作为一种即插即用的扩展方法,无需修改现有密集反演方法的损失函数即可显著加速反演过程。

Method: 选择性反演语义前景,避免对噪声背景和虚假相关性进行反演。

Result: 实验表明,该方法实现了最高3.79倍的加速,同时在数据自由模型量化和知识迁移任务中保持了可比较或更优的性能。

Insight: 稀疏反演策略不仅提升了效率,还避免了模型反演中不必要的噪声和虚假相关性,为数据自由任务的实用化提供了新思路。

Abstract: Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations–a phenomenon we term “hallucination” in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.

[39] Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了一个新任务MIVA(多模态交互真实性评估),并基于社交推理游戏狼人杀构建了一个多模态数据集,用于评估多模态大语言模型(MLLMs)在多党社交互动中辨别真假的能力。研究发现,即使是GPT-4o这样的强大模型也难以可靠地区分真假。

Details Motivation: 随着AI系统越来越多地融入人类生活,赋予其强大的社交智能成为一个关键前沿领域。其中,辨别真假是多党社交互动中的重要能力,但目前自动检测动态多党对话中的欺骗行为仍是一个重大挑战。

Contribution: 1. 提出了MIVA任务,填补了MLLMs在多党社交互动中真实性评估能力的研究空白。2. 基于狼人杀游戏构建了一个包含同步视频、文本和真实标签的多模态数据集。3. 对当前最先进的MLLMs进行了全面评估,揭示了它们在辨别真假方面的性能差距和局限性。

Method: 1. 设计并收集了一个多模态数据集,数据来源于狼人杀游戏。2. 提出了MIVA任务,用于评估MLLMs在多党社交互动中的真实性评估能力。3. 对包括GPT-4o在内的多个MLLMs进行了系统性基准测试,并分析了它们的失败模式。

Result: 研究发现,即使是目前最先进的MLLMs(如GPT-4o)在多党社交互动中辨别真假的表现也不理想。这些模型难以有效地将语言与视觉社交线索结合起来,且可能由于对齐过度保守而表现不佳。

Insight: 1. 当前的MLLMs在多党社交互动中辨别真假的能力有待提高。2. 模型无法有效融合语言和视觉社交线索是性能不佳的主要原因之一。3. 需要开发新的方法来构建更具感知力和可信赖的AI系统。

Abstract: As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.

[40] Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks

Jiaxin Zhang,Zehong Zhu,Junye Deng,Yunqin Li,and Bowen Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于层次图神经网络(HGNN)的多模态特征融合方法,用于传统村落空间形态分析,显著提升了分类任务的性能。

Details Motivation: 传统村落的空间特征逐渐消失和景观同质化问题突出,现有研究多采用单学科视角和定性分析方法,受限于数据不足和数字基础设施缺乏,亟需多模态数据融合的定量分析方法。

Contribution: 提出了一个层次图神经网络模型(HGNN),结合了图卷积网络(GCN)和图注意力网络(GAT),通过两阶段特征更新机制高效融合多模态特征,并引入关系池化机制和联合训练策略,显著提升了分类性能。

Method: HGNN模型包含输入节点和通信节点,静态输入边和动态通信边,结合GCN和GAT,采用两阶段特征更新机制和多任务联合训练策略。

Result: 在17个子类型的分类任务中,联合训练策略将平均准确率/F1分数从0.71/0.83提升至0.82/0.90,其中地块任务的性能提升了6%。

Insight: 多模态数据融合和层次图神经网络能有效解决村落空间形态分析的复杂性和数据不足问题,为村落空间模式生成逻辑提供了科学依据。

Abstract: Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.

[41] MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts

Jingnan Gao,Zhe Wang,Xianze Fang,Xingyu Ren,Zhuo Chen,Shengqi Liu,Yuhao Cheng,Jiangjing Lyu,Xiaokang Yang,Yichao Yan

Main category: cs.CV

TL;DR: MoRE提出了一种基于Mixture-of-Experts架构的密集3D视觉基础模型,通过动态路由特征到任务专家,提升3D视觉几何重建的可扩展性与适应性,同时引入深度细化模块和语义特征集成,实现多任务高性能。

Details Motivation: 3D视觉几何重建中,扩大模型规模虽有效但面临几何监督复杂性和数据多样性的挑战,需新方法解决可扩展性和鲁棒性问题。

Contribution: 1. 提出MoRE模型,采用MoE架构动态分配任务专家;2. 引入深度细化模块和语义特征集成;3. 设计定制损失函数支持多任务鲁棒学习。

Method: 基于MoE架构,动态路由特征到专家;结合深度细化模块优化几何估计;集成语义特征和全局3D表征;多任务损失优化。

Result: 在多个基准测试中实现SOTA性能,支持高效下游应用。

Insight: 动态任务专家分配和置信度驱动的几何细化是提升3D重建鲁棒性和适应性的关键。

Abstract: Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.

[42] Object-IR: Leveraging Object Consistency and Mesh Deformation for Self-Supervised Image Retargeting

Tianli Liao,Ran Wang,Siqing Zhang,Lei Li,Guangen Liu,Chenyang Zhao,Heling Cao,Peng Li

Main category: cs.CV

TL;DR: Object-IR提出了一种自监督的图像重定向方法,通过基于学习的网格变形优化和对象一致性约束,减少重要语义区域的几何畸变。

Details Motivation: 图像重定向中消除语义重要区域的几何畸变是一个难题。传统方法依赖手动标注数据,而Object-IR通过自监督学习避免了这一需求。

Contribution: 1. 提出了一种自监督的图像重定向框架;2. 设计了对象一致性损失、几何保留损失和边界损失的综合目标函数;3. 在RetargetMe基准上实现了最先进的性能。

Method: 1. 使用CNN预测网格变形;2. 结合对象一致性、几何保留和边界损失的优化目标;3. 无需手动标注,直接从输入图像中提取监督信号。

Result: 在RetargetMe基准上,Object-IR在定量指标和主观视觉评估中均优于现有方法,且在消费级GPU上实现了实时处理(平均0.009秒)。

Insight: 通过自监督学习和综合目标函数,能够在不依赖标注数据的情况下有效减少语义重要区域的畸变,适用于任意分辨率输入。

Abstract: Eliminating geometric distortion in semantically important regions remains an intractable challenge in image retargeting. This paper presents Object-IR, a self-supervised architecture that reformulates image retargeting as a learning-based mesh warping optimization problem, where the mesh deformation is guided by object appearance consistency and geometric-preserving constraints. Given an input image and a target aspect ratio, we initialize a uniform rigid mesh at the output resolution and use a convolutional neural network to predict the motion of each mesh grid and obtain the deformed mesh. The retargeted result is generated by warping the input image according to the rigid mesh in the input image and the deformed mesh in the output resolution. To mitigate geometric distortion, we design a comprehensive objective function incorporating a) object-consistent loss to ensure that the important semantic objects retain their appearance, b) geometric-preserving loss to constrain simple scale transform of the important meshes, and c) boundary loss to enforce a clean rectangular output. Notably, our self-supervised paradigm eliminates the need for manually annotated retargeting datasets by deriving supervision directly from the input’s geometric and semantic properties. Extensive evaluations on the RetargetMe benchmark demonstrate that our Object-IR achieves state-of-the-art performance, outperforming existing methods in quantitative metrics and subjective visual quality assessments. The framework efficiently processes arbitrary input resolutions (average inference time: 0.009s for 1024x683 resolution) while maintaining real-time performance on consumer-grade GPUs. The source code will soon be available at https://github.com/tlliao/Object-IR.

[43] Fusion of Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis

Zhidong Yang,Xiuhui Shi,Wei Ba,Zhigang Song,Haijing Luan,Taiyuan Hu,Senlin Lin,Jiguang Wang,Shaohua Kevin Zhou,Rui Yan

Main category: cs.CV

TL;DR: 论文提出了一种名为FuseCPath的框架,用于融合异构病理基础模型(FMs),以提升全切片图像(WSI)分析的性能。通过多视角聚类和协作蒸馏策略,FuseCPath优化了特征提取和融合过程,在多种癌症数据集上实现了最先进的性能。

Details Motivation: 当前病理基础模型由于训练数据和网络架构的多样性存在显著异质性,导致下游任务性能不稳定。为了充分利用多个模型的优势,需要一种有效的融合方法。

Contribution: (1)提出多视角聚类方法筛选具有代表性的训练样本;(2)设计簇级重嵌入策略融合异构局部特征;(3)提出协作蒸馏策略融合全局特征。

Method: 采用多视角聚类筛选训练样本,簇级重嵌入策略融合局部特征,协作蒸馏策略融合全局特征。

Result: 在TCGA的肺癌、膀胱癌和结直肠癌数据集上,FuseCPath表现出最先进的性能。

Insight: 通过多视角聚类和协作蒸馏,可以有效融合异构模型的优势,提高WSI分析的鲁棒性和性能。

Abstract: Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathological foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level feature representations from WSIs. However, current pathological FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the extracted features from different FMs in the downstream tasks. To fully explore the advantage of multiple FMs effectively, in this work, we propose a novel framework for the fusion of heterogeneous pathological FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs’ embeddings. (ii) To effectively fuse the heterogeneous patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the heterogeneous slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments conducted on lung cancer, bladder cancer, and colorectal cancer datasets from The Cancer Genome Atlas (TCGA) have demonstrated that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on these public datasets.

[44] Trans-defense: Transformer-based Denoiser for Adversarial Defense with Spatial-Frequency Domain Representation

Alik Pramanick,Mayank Bansal,Utkarsh Srivastava,Suklav Ghosh,Arijit Sur

Main category: cs.CV

TL;DR: 该论文提出了一种基于Transformer的双阶段训练方法,结合空间和频域表征,用于防御对抗攻击,显著提升了分类器的鲁棒性。

Details Motivation: 深度神经网络(DNNs)在对抗攻击下表现脆弱,限制了其在安全关键系统中的应用。论文旨在通过结合空间和频域信息的去噪策略提升模型防御能力。

Contribution: 1. 提出了一种结合空间和频域的去噪网络;2. 利用离散小波变换(DWT)分析高频信息;3. 使用Transformer层整合空间和频域特征;4. 通过双阶段训练增强分类器的鲁棒性。

Method: 1. 第一阶段训练去噪网络,结合DWT和小波变换处理高频信息;2. 第二阶段用去噪后的图像重新训练分类器;3. 通过Transformer层融合空间和频域特征。

Result: 在MNIST、CIFAR-10和Fashion-MNIST数据集上,该方法显著提升了分类准确率,优于传统去噪网络和对抗训练方法。

Insight: 高频信息在对抗攻击中更容易被破坏,结合频域分析的去噪方法能更有效地防御攻击。

Abstract: In recent times, deep neural networks (DNNs) have been successfully adopted for various applications. Despite their notable achievements, it has become evident that DNNs are vulnerable to sophisticated adversarial attacks, restricting their applications in security-critical systems. In this paper, we present two-phase training methods to tackle the attack: first, training the denoising network, and second, the deep classifier model. We propose a novel denoising strategy that integrates both spatial and frequency domain approaches to defend against adversarial attacks on images. Our analysis reveals that high-frequency components of attacked images are more severely corrupted compared to their lower-frequency counterparts. To address this, we leverage Discrete Wavelet Transform (DWT) for frequency analysis and develop a denoising network that combines spatial image features with wavelets through a transformer layer. Next, we retrain the classifier using the denoised images, which enhances the classifier’s robustness against adversarial attacks. Experimental results across the MNIST, CIFAR-10, and Fashion-MNIST datasets reveal that the proposed method remarkably elevates classification accuracy, substantially exceeding the performance by utilizing a denoising network and adversarial training approaches. The code is available at https://github.com/Mayank94/Trans-Defense.

[45] C-LEAD: Contrastive Learning for Enhanced Adversarial Defense

Suklav Ghosh,Sonal Kumar,Arijit Sur

Main category: cs.CV

TL;DR: 本文提出了一种新颖的对抗防御方法C-LEAD,利用对比学习增强分类模型的鲁棒性,通过对比损失函数训练模型使用干净和对抗扰动图像,提取更具信息量和鲁棒的特征。

Details Motivation: 深度神经网络在计算机视觉任务中表现优异,但对对抗攻击脆弱,输入图像的微小扰动可能导致错误预测。需要开发鲁棒的深度学习系统。

Contribution: 首次将对比学习应用于对抗防御领域,提出C-LEAD方法,通过对比损失函数优化模型参数和扰动,提升模型的鲁棒性。

Method: 采用对比学习框架,结合干净和对抗扰动图像训练模型,利用对比损失函数提取鲁棒特征。

Result: 实验结果显示,该方法显著提高了模型对各种对抗扰动的鲁棒性,表明对比损失能帮助提取更具信息量和鲁棒的特征。

Insight: 对比学习不仅能用于无监督学习,还可增强模型的对抗鲁棒性,为对抗防御领域提供了新的方向。

Abstract: Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model’s parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model’s robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning.

[46] Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Yehna Kim andYoung-Eun Kim,Seong-Whan Lee

Main category: cs.CV

TL;DR: 该论文提出了一种通过语言驱动描述属性增强时空零样本动作识别的方法,利用网络爬取的描述和大型语言模型提取关键词,无需人工标注,同时引入时空交互模块,显著提升了零样本动作识别的性能。

Details Motivation: 传统的视觉-语言模型在零样本动作识别中仅依赖动作类别提供语义上下文,容易因多义词引入歧义,限制了模型的性能。为了解决这一问题,论文提出了利用网络描述和大语言模型提取关键词的方法。

Contribution: 1. 提出了一种新的语言驱动描述属性方法,减少了对人工标注的依赖;2. 设计了时空交互模块,专注于对象和动作单元的对齐;3. 在多个数据集上实现了显著的零样本动作识别性能提升。

Method: 1. 使用网络爬取的描述和大语言模型提取关键词;2. 设计时空交互模块以对齐描述属性和视频内容;3. 在多个数据集上进行零样本实验验证。

Result: 在UCF-101、HMDB-51和Kinetics-600数据集上分别达到了81.0%、53.1%和68.9%的准确率,证明了方法的有效性和适应性。

Insight: 通过语言驱动的描述属性可以有效减少语义歧义,提升零样本动作识别的性能;时空交互模块的设计为视频内容与描述的对齐提供了新思路。

Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model’s adaptability and effectiveness across various downstream tasks.

[47] RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents

Yinglu Li,Zhiying Lu,Zhihang Liu,Chuanbin Liu,Hongtao Xie

Main category: cs.CV

TL;DR: RegionRAG提出了一种基于区域级别的多模态检索增强生成框架,通过将检索单位从文档级细化到区域级,显著减少了无关视觉内容的干扰,提升了生成模型的效率和准确性。

Details Motivation: 当前的多模态检索增强生成方法将整个文档作为检索单位,导致大量无关视觉内容被引入,不仅分散了模型的注意力,还降低了性能。为了解决这一问题,研究团队提出了一种区域级的检索方法。

Contribution: 1) 将检索单位从文档级细化到区域级,减少了无关内容的干扰;2) 设计了混合监督策略和动态管道,用于训练和推理过程中精准定位相关语义区域;3) 在多个基准测试中取得了优于现有方法的性能。

Method: 1) 在训练阶段,利用标注和未标注数据的混合监督策略识别相关区域;2) 在推理阶段,通过动态管道将显著语义区域分组;3) 通过区域级检索减少视觉标记的使用,提升效率。

Result: 在六个基准测试中,RegionRAG的平均检索准确率(R@1)提升了10.02%,问答准确率提升了3.56%,同时仅使用了之前方法71.42%的视觉标记。

Insight: 通过将检索单位细化到区域级,可以有效减少无关视觉内容的干扰,从而显著提升多模态检索增强生成模型的性能和效率。

Abstract: Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model’s attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02% in R@1 on average and increases question answering accuracy by 3.56% while using only 71.42% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.

[48] T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis

Raza Imam,Hu Wang,Dwarikanath Mahapatra,Mohammad Yaqub

Main category: cs.CV

TL;DR: 该论文提出了T3(Test-Time Task adaptive merging),一种无需反向传播的框架,用于动态合并视觉语言模型(VLMs)中的通用模型和专家模型,以解决医学影像分析中的模态偏移问题。通过Jensen-Shannon散度计算每样本或批次的合并系数,T3在多样医学模态中表现出优异的性能。

Details Motivation: 医学影像分析中,预训练模型具有鲁棒性但缺乏模态特定特征,而微调专家模型在分布内表现好但对模态偏移敏感。现有模型合并方法静态且不适用于多样医学任务,亟需动态、高效的解决方案。

Contribution: 1. 提出T3框架,动态计算合并系数;2. 提出批次扩展T3_B降低计算成本;3. 建立标准医学合并基准,涵盖多种模态和任务。

Method: 利用Jensen-Shannon散度计算模型输出分布的距离,动态调整通用模型和专家模型的合并权重。T3_B通过批次计算进一步优化效率。

Result: 在Top-1准确率和错误率降低上达到新SOTA,跨模态任务中表现优异,同时保持高效计算。

Insight: 动态合并策略在医学影像分析中至关重要,可平衡通用性和特异性;批次处理显著提升计算效率,适合临床应用。

Abstract: In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models’ output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.

[49] HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

Shaojie Zhang,Pei Fu,Ruoceng Zhang,Jiahui Yang,Anan Du,Xiuwen Xi,Shaokang Wang,Ying Huang,Bin Qin,Zhenbo Luo,Jian Luan

Main category: cs.CV

TL;DR: HyperClick提出了一个通过不确定性校准提升GUI基础模型可靠性的框架,结合二元奖励和高斯空间置信度建模,优化准确性同时减少过度自信。

Details Motivation: 当前GUI基础模型缺乏对自身能力界限的认识,导致过度自信和不可靠的预测,这在动态GUI自动化任务中尤为关键。

Contribution: 1) 提出了HyperClick框架,通过不确定性校准提升GUI基础模型的可靠性;2) 引入双奖励机制,结合二元奖励和高斯空间置信度建模;3) 在七个挑战基准上实现SOTA性能。

Method: 1) 使用Brier分数校准置信度;2) 结合二元奖励和高斯空间置信度建模;3) 通过联合优化提升准确性和置信度的可靠性。

Result: 在七个基准测试中,HyperClick表现出优于现有方法的性能,同时提供校准良好的置信度。

Insight: 通过显式的置信度校准和内省自我批评,可以有效减少模型的过度自信,提升GUI自动化任务的可靠性。

Abstract: Autonomous Graphical User Interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions. We first systematically evaluate probabilistic and verbalized confidence in general and GUI-specific models, revealing a misalignment between confidence and actual accuracy, which is particularly critical in dynamic GUI automation tasks, where single errors can cause task failure. To address this, we propose HyperClick, a novel framework that enhances reliable GUI grounding through uncertainty calibration. HyperClick introduces a dual reward mechanism, combining a binary reward for correct actions with a truncated Gaussian-based spatial confidence modeling, calibrated using the Brier score. This approach jointly optimizes grounding accuracy and confidence reliability, fostering introspective self-criticism. Extensive experiments on seven challenge benchmarks show that HyperClick achieves state-of-the-art performance while providing well-calibrated confidence. By enabling explicit confidence calibration and introspective self-criticism, HyperClick reduces overconfidence and supports more reliable GUI automation.

[50] FOCUS: Efficient Keyframe Selection for Long Video Understanding

Zirui Zhu,Hailun Xu,Yang Luo,Yong Liu,Kanchan Sarkar,Zhenheng Yang,Yang You

Main category: cs.CV

TL;DR: FOCUS 是一种无需训练、与模型无关的关键帧选择模块,通过将关键帧选择建模为多臂老虎机问题,实现高效的视频理解。

Details Motivation: 传统的关键帧选择方法依赖于预过滤或检索式评分,可能导致信息丢失或效率低下。FOCUS 旨在在严格的 token 预算下选择最相关且信息丰富的帧。

Contribution: 提出了 FOCUS,一种基于组合纯探索问题的关键帧选择方法,能够有效识别高价值时间区域并选择信息丰富的帧。

Method: 将关键帧选择建模为多臂老虎机问题,使用经验均值和伯恩斯坦置信半径分两阶段选择关键帧。

Result: 在两个长视频问答基准测试中,FOCUS 仅处理不到 2% 的视频帧,但显著提升了准确性(在 LongVideoBench 上提升了 11.9%)。

Insight: FOCUS 提供了一种简单通用的方法,可扩展 MLLMs 的长视频理解能力,同时避免了传统方法的局限性。

Abstract: Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.

[51] Rethinking Robust Adversarial Concept Erasure in Diffusion Models

Qinghong Yin,Yu Tian,Yue Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为S-GRACE的新方法,通过在概念空间中利用语义指导生成对抗样本,显著提升了扩散模型中对不良概念擦除的效果,同时保留了非目标概念并减少了训练时间。

Details Motivation: 现有的对抗训练方法在扩散模型中擦除不良概念时,忽视了概念空间的语义特异性,导致擦除效果不理想或干扰其他概念。

Contribution: 提出了S-GRACE方法,首次在概念擦除中引入语义指导,提升了对抗样本对目标概念的覆盖能力,显著改善了擦除效果和非目标概念的保留。

Method: S-GRACE通过语义引导生成对抗样本,并在概念空间中进行对抗训练,实现了更全面的概念擦除。

Result: 相比七种现有方法,S-GRACE提升了26%的擦除性能,减少了90%的训练时间,并更好地保留了非目标概念。

Insight: 语义指导在概念擦除中至关重要,能够帮助对抗样本更精准地覆盖目标概念空间,避免对其他概念的干扰。

Abstract: Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

[52] Versatile and Efficient Medical Image Super-Resolution Via Frequency-Gated Mamba

Wenfeng Huang,Xiangyun Liao,Wei Cao,Wenjing Jia,Weixin Si

Main category: cs.CV

TL;DR: 论文提出了一种名为FGMamba的新方法,通过频率感知门控状态空间模型,结合全局依赖建模与细粒度频率细节增强,实现了高效且轻量级的医学图像超分辨率。

Details Motivation: 医学图像超分辨率对提升诊断准确性至关重要,但现有方法在高效建模长程解剖结构和细粒度频率细节方面仍存在挑战。

Contribution: 1. 提出了GASM模块,将状态空间建模与双分支空间和通道注意力相结合;2. 设计了PFFM模块,通过FFT引导的融合捕获多分辨率高频细节。

Method: FGMamba结合了门控注意力增强的状态空间模块(GASM)和金字塔频率融合模块(PFFM),实现了高效全局建模和多分辨率高频细节增强。

Result: 在五种医学成像模态(超声、OCT、MRI、CT和内镜)上,FGMamba在PSNR/SSIM指标上优于CNN和Transformer的SOTA方法,同时保持轻量级参数(<0.75M)。

Insight: 频率感知的状态空间建模能够兼顾全局依赖和局部细节,为医学图像增强提供了可扩展且高效的解决方案。

Abstract: Medical image super-resolution (SR) is essential for enhancing diagnostic accuracy while reducing acquisition cost and scanning time. However, modeling both long-range anatomical structures and fine-grained frequency details with low computational overhead remains challenging. We propose FGMamba, a novel frequency-aware gated state-space model that unifies global dependency modeling and fine-detail enhancement into a lightweight architecture. Our method introduces two key innovations: a Gated Attention-enhanced State-Space Module (GASM) that integrates efficient state-space modeling with dual-branch spatial and channel attention, and a Pyramid Frequency Fusion Module (PFFM) that captures high-frequency details across multiple resolutions via FFT-guided fusion. Extensive evaluations across five medical imaging modalities (Ultrasound, OCT, MRI, CT, and Endoscopic) demonstrate that FGMamba achieves superior PSNR/SSIM while maintaining a compact parameter footprint ($<$0.75M), outperforming CNN-based and Transformer-based SOTAs. Our results validate the effectiveness of frequency-aware state-space modeling for scalable and accurate medical image enhancement.

[53] CASR-Net: An Image Processing-focused Deep Learning-based Coronary Artery Segmentation and Refinement Network for X-ray Coronary Angiogram

Alvee Hassan,Rusab Sarmun,Muhammad E. H. Chowdhury,M. Murugappan,Md. Sakib Abrar Hossain,Sakib Mahmud,Abdulrahman Alqahtani,Sohaib Bassam Zoghoul,Amith Khandakar,Susu M. Zughaier,Somaya Al-Maadeed,Anwarul Hasan

Main category: cs.CV

TL;DR: CASR-Net是一个用于X光冠状动脉造影图像分割的三阶段深度学习网络,通过预处理、分割和细化模块,显著提升了冠状动脉分割的精度。

Details Motivation: 冠状动脉疾病(CAD)的早期检测对降低死亡率和改善治疗计划至关重要,但X光图像质量差会影响诊断。CASR-Net旨在自动化分割冠状动脉,为临床支持提供工具。

Contribution: 提出了一个三阶段管道(预处理、分割和细化),结合了CLAHE和改进的Ben Graham方法的预处理策略,以及基于UNet和Self-ONN的分割网络,显著提升了分割性能。

Method: 1. 多通道预处理(CLAHE+改进Ben Graham方法);2. 分割网络(UNet+DenseNet121编码器+Self-ONN解码器);3. 轮廓细化模块抑制假阳性。

Result: 在公开数据集上,CASR-Net的IoU达到61.43%,DSC为76.10%,clDice为79.36%,优于其他先进模型。

Insight: CASR-Net通过多阶段设计和Self-ONN解码器,有效解决了狭窄血管分支的连续性问题,为临床图像分割提供了高精度解决方案。

Abstract: Early detection of coronary artery disease (CAD) is critical for reducing mortality and improving patient treatment planning. While angiographic image analysis from X-rays is a common and cost-effective method for identifying cardiac abnormalities, including stenotic coronary arteries, poor image quality can significantly impede clinical diagnosis. We present the Coronary Artery Segmentation and Refinement Network (CASR-Net), a three-stage pipeline comprising image preprocessing, segmentation, and refinement. A novel multichannel preprocessing strategy combining CLAHE and an improved Ben Graham method provides incremental gains, increasing Dice Score Coefficient (DSC) by 0.31-0.89% and Intersection over Union (IoU) by 0.40-1.16% compared with using the techniques individually. The core innovation is a segmentation network built on a UNet with a DenseNet121 encoder and a Self-organized Operational Neural Network (Self-ONN) based decoder, which preserves the continuity of narrow and stenotic vessel branches. A final contour refinement module further suppresses false positives. Evaluated with 5-fold cross-validation on a combination of two public datasets that contain both healthy and stenotic arteries, CASR-Net outperformed several state-of-the-art models, achieving an IoU of 61.43%, a DSC of 76.10%, and clDice of 79.36%. These results highlight a robust approach to automated coronary artery segmentation, offering a valuable tool to support clinicians in diagnosis and treatment planning.

[54] SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction

Wenfeng Huang,Xiangyun Liao,Yinling Qian,Hao Liu,Yongming Yang,Wenjing Jia,Qiong Wang

Main category: cs.CV

TL;DR: SAGS是一种自适应的无别名高斯泼溅框架,专为动态手术内窥镜重建设计,通过引入注意力驱动的动态加权4D变形解码器,结合3D平滑滤波器和2D Mip滤波器,显著减少了组织运动引起的伪影和别名问题。

Details Motivation: 动态组织的手术重建从内窥镜视频中是一项关键技术,但现有方法在处理组织运动引起的伪影和别名问题时表现不佳,SAGS旨在解决这些问题。

Contribution: 提出了SAGS框架,结合注意力机制的动态加权4D变形解码器,通过3D平滑和2D Mip滤波器优化动态组织重建质量。

Method: 基于3D高斯泼溅,引入动态加权4D变形解码器,结合3D平滑滤波器和2D Mip滤波器,以减少伪影并捕捉组织细节。

Result: 在EndoNeRF和SCARED两个公开基准测试中,SAGS在PSNR、SSIM和LPIPS指标上均优于现有方法,且可视化质量更优。

Insight: 动态组织重建的关键在于减少运动伪影和别名问题,SAGS通过注意力机制和滤波器组合实现了这一目标。

Abstract: Surgical reconstruction of dynamic tissues from endoscopic videos is a crucial technology in robot-assisted surgery. The development of Neural Radiance Fields (NeRFs) has greatly advanced deformable tissue reconstruction, achieving high-quality results from video and image sequences. However, reconstructing deformable endoscopic scenes remains challenging due to aliasing and artifacts caused by tissue movement, which can significantly degrade visualization quality. The introduction of 3D Gaussian Splatting (3DGS) has improved reconstruction efficiency by enabling a faster rendering pipeline. Nevertheless, existing 3DGS methods often prioritize rendering speed while neglecting these critical issues. To address these challenges, we propose SAGS, a self-adaptive alias-free Gaussian splatting framework. We introduce an attention-driven, dynamically weighted 4D deformation decoder, leveraging 3D smoothing filters and 2D Mip filters to mitigate artifacts in deformable tissue reconstruction and better capture the fine details of tissue movement. Experimental results on two public benchmarks, EndoNeRF and SCARED, demonstrate that our method achieves superior performance in all metrics of PSNR, SSIM, and LPIPS compared to the state of the art while also delivering better visualization quality.

[55] Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis

Weiming Chen,Yijia Wang,Zhihan Zhu,Zhihai He

Main category: cs.CV

TL;DR: 本文提出了一种结合生成模型和深度图像压缩的超低比特率视觉通信方法,通过联合文本和编码隐变量引导校正流模型,实现精确视觉场景重建。

Details Motivation: 挑战场景(如深空探测、战场情报)中带宽极低,现有文本到图像生成模型仅能语义级近似视觉场景,无法满足视觉通信和远程分析需求。

Contribution: 提出了一种将图像生成与深度压缩无缝结合的方法,利用文本和编码隐变量引导校正流模型,实现高质量重建和分析。

Method: 联合文本和编码隐变量指导校正流模型生成视觉场景,语义文本和编码隐变量以极低比特率编码传输。

Result: 实验表明,该方法在极低带宽下实现与现有方法相同的重建质量和分析精度。

Insight: 生成模型与深度压缩的结合为超低比特率视觉通信提供了新思路,有望应用于带宽受限场景。

Abstract: We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.

[56] MeisenMeister: A Simple Two Stage Pipeline for Breast Cancer Classification on MRI

Benjamin Hamm,Yannick Kirchhoff,Maximilian Rokuss,Klaus Maier-Hein

Main category: cs.CV

TL;DR: 该论文提出了一个简单的两阶段流程MeisenMeister,用于在MRI上进行乳腺癌分类,目标是提高乳腺癌早期检测的效率和准确性。

Details Motivation: 乳腺癌MRI分类的挑战在于高质量分割标签的稀缺性,因此需要开发基于分类的鲁棒方法以支持早期检测和大规模筛查。

Contribution: 提出了一个两阶段的乳腺癌MRI分类流程,公开了完整代码实现,强调了性能、鲁棒性和临床相关性。

Method: 采用两阶段流程,通过实验、评估和迭代优化来完善解决方案。

Result: 论文未提供具体性能数据,但强调了方法的鲁棒性和临床适用性。

Insight: 分类方法的开发可以减轻对高质量分割标签的依赖,为乳腺癌MRI筛查提供了新思路。

Abstract: The ODELIA Breast MRI Challenge 2025 addresses a critical issue in breast cancer screening: improving early detection through more efficient and accurate interpretation of breast MRI scans. Even though methods for general-purpose whole-body lesion segmentation as well as multi-time-point analysis exist, breast cancer detection remains highly challenging, largely due to the limited availability of high-quality segmentation labels. Therefore, developing robust classification-based approaches is crucial for the future of early breast cancer detection, particularly in applications such as large-scale screening. In this write-up, we provide a comprehensive overview of our approach to the challenge. We begin by detailing the underlying concept and foundational assumptions that guided our work. We then describe the iterative development process, highlighting the key stages of experimentation, evaluation, and refinement that shaped the evolution of our solution. Finally, we present the reasoning and evidence that informed the design choices behind our final submission, with a focus on performance, robustness, and clinical relevance. We release our full implementation publicly at https://github.com/MIC-DKFZ/MeisenMeister

[57] Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

Yijia Wang,Yiqing Shen,Weiming Chen,Zhihai He

Main category: cs.CV

TL;DR: 本文提出了一种名为CIELR的新方法,通过推理大型语言模型(LLM)将复杂的图像编辑指令分解为简单明确的操作,避免了联合微调LLM和扩散模型的高成本。

Details Motivation: 现有图像编辑方法在处理复杂指令时需要联合微调LLM和扩散模型,计算成本和训练成本较高,因此需要一种更高效的方法。

Contribution: 提出了CIELR方法,通过结构化语义表示和迭代更新机制实现复杂图像编辑,无需联合微调模型。还构建了CIEBench基准数据集和专用评估指标。

Method: 使用基础模型构建输入图像的结构化语义表示,并通过迭代更新机制细化表示,生成细粒度视觉表示以支持复杂编辑任务。

Result: 在PSNR指标上比现有方法提升9.955 dB,并在自建基准CIEBench上表现优异。

Insight: 通过分解复杂指令和细化视觉表示,可以高效完成复杂编辑任务,同时避免了高成本的联合微调。

Abstract: Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.

[58] RzenEmbed: Towards Comprehensive Multimodal Retrieval

Weijian Jian,Yajun Zhang,Dawei Liang,Chunyu Xie,Yixiao He,Dawei Leng,Yuhui Yin

Main category: cs.CV

TL;DR: RzenEmbed 是一个统一的多模态检索框架,支持文本、图像、视频和视觉文档等多种模态,通过新颖的两阶段训练策略和改进的 InfoNCE 损失函数,显著提升了多模态检索性能。

Details Motivation: 现有的 CLIP 框架主要关注自然图像,而忽略了其他重要视觉模态(如视频和视觉文档)的支持。RzenEmbed 旨在填补这一空白,提供全面的多模态检索能力。

Contribution: 1. 引入 RzenEmbed,一个支持多模态的统一嵌入学习框架;2. 提出两阶段训练策略和改进的 InfoNCE 损失函数(包括难度加权机制和抑制假负样本的影响);3. 在 MMEB 基准测试中实现了新的最优性能。

Method: 1. 两阶段训练:第一阶段专注于基础文本和多模态检索,第二阶段引入改进的 InfoNCE 损失;2. 改进的损失函数包括难度加权和假负样本抑制;3. 使用可学习温度参数和模型融合(model souping)提升性能。

Result: RzenEmbed 在 MMEB 基准测试中取得了最优的整体成绩,尤其在视频和视觉文档检索任务上显著超越现有方法。

Insight: 1. 多模态检索需要统一框架支持多种模态;2. 难度加权和假负样本抑制对提升模型判别力至关重要;3. 两阶段训练策略结合优化损失函数能够显著提升性能。

Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model’s discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.

[59] Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V

Meftun Akarsu,Kerem Catay,Sedat Bin Vedat,Enes Kutay Yarkan,Ilke Senturk,Arda Sar,Dafne Eksioglu

Main category: cs.CV

TL;DR: 该论文提出了一种基于LoRA和Wan2.1 I2V的两阶段微调流程,用于从小数据集中生成电影级视频场景,实现了视觉风格与运动的解耦学习。

Details Motivation: 电影和电视制作需要高质量的视觉内容生成,但传统方法依赖大数据,且难以高效适应特定风格。该研究旨在通过小数据集实现高效的视觉风格迁移和运动生成。

Contribution: 主要贡献包括:1) 提出了一个两阶段的微调流程,解耦视觉风格学习与运动生成;2) 通过LoRA模块高效适应小数据集;3) 在单GPU上快速完成风格迁移;4) 提出了加速推理的策略。

Method: 方法分为两阶段:1) 使用LoRA微调Wan2.1 I2V-14B的跨注意力层,适应特定视觉风格;2) 生成风格一致的关键帧后,通过视频解码器扩展为连贯序列。采用了轻量级并行化和序列划分加速推理。

Result: 实验通过FVD、CLIP-SIM和LPIPS指标及专家评测,证明了生成的视频在电影级逼真度和时间稳定性上的提升。

Insight: 该研究展示了小数据集通过LoRA等高效微调技术在电影级视频生成中的潜力,为实际制作提供了实用工具。

Abstract: We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim’s historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model’s video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.

[60] Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

Wu Wei,Xiaomeng Fan,Yuwei Wu,Zhi Gao,Pengxiang Li,Yunde Jia,Mehrtash Harandi

Main category: cs.CV

TL;DR: 该论文提出了一种跨树的模态对齐方法(Alignment across Trees),通过构建树状层次特征并嵌入双曲流形,实现了图像和文本模态的有效对齐,在少样本和跨域任务中表现优异。

Details Motivation: 现有视觉-语言模型(VLMs)通常从文本中提取层次特征,而对图像仅提取单一特征,导致模态对齐不对称且效果不佳。为此,作者提出对齐两种模态的树状层次特征以优化性能。

Contribution: 1. 提出了树状层次特征对齐方法;2. 引入了语义感知的视觉特征提取框架;3. 设计了双曲流形的异构对齐机制。

Method: 1. 基于交叉注意力机制提取多粒度视觉特征;2. 将特征树嵌入具有不同曲率的双曲流形;3. 通过最小化KL距离学习中间流形对齐异构流形。

Result: 在多个图像数据集的任务中,该方法在少样本和跨域场景下均显著优于基线模型。

Insight: 异构双曲流形的对齐是解决模态不对称问题的关键,且最优中间流形的存在性和唯一性为理论支持。

Abstract: Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

[61] A Hybrid Deep Learning and Forensic Approach for Robust Deepfake Detection

Sales Aribe Jr

Main category: cs.CV

TL;DR: 该论文提出了一种结合深度学习与法医分析的混合方法,用于鲁棒的深度伪造检测,显著提升了检测性能和透明度。

Details Motivation: 随着GAN和扩散模型的快速发展,合成媒体越来越逼真,导致社会对虚假信息和数字信任的担忧。现有方法要么泛化能力差,要么对抗新技术的有效性有限。

Contribution: 提出了一种混合框架,融合了法医特征(如噪声残差、JPEG压缩痕迹)与深度学习表征(CNN和ViT),在多个基准数据集上表现优于现有方法。

Method: 结合法医分析和深度学习特征,利用噪声残差、频率域描述符等技术提取特征,并通过Grad-CAM和法医热力图提升可解释性。

Result: 在FaceForensics++、Celeb-DF v2和DFDC数据集上分别取得F1分数0.96、0.82和0.77,抗压缩和对抗扰动性能稳定。

Insight: 混合方法结合了深度学习的适应性和法医分析的可解释性,是构建鲁棒且可信的深度伪造检测系统的有效途径。

Abstract: The rapid evolution of generative adversarial networks (GANs) and diffusion models has made synthetic media increasingly realistic, raising societal concerns around misinformation, identity fraud, and digital trust. Existing deepfake detection methods either rely on deep learning, which suffers from poor generalization and vulnerability to distortions, or forensic analysis, which is interpretable but limited against new manipulation techniques. This study proposes a hybrid framework that fuses forensic features, including noise residuals, JPEG compression traces, and frequency-domain descriptors, with deep learning representations from convolutional neural networks (CNNs) and vision transformers (ViTs). Evaluated on benchmark datasets (FaceForensics++, Celeb-DF v2, DFDC), the proposed model consistently outperformed single-method baselines and demonstrated superior performance compared to existing state-of-the-art hybrid approaches, achieving F1-scores of 0.96, 0.82, and 0.77, respectively. Robustness tests demonstrated stable performance under compression (F1 = 0.87 at QF = 50), adversarial perturbations (AUC = 0.84), and unseen manipulations (F1 = 0.79). Importantly, explainability analysis showed that Grad-CAM and forensic heatmaps overlapped with ground-truth manipulated regions in 82 percent of cases, enhancing transparency and user trust. These findings confirm that hybrid approaches provide a balanced solution, combining the adaptability of deep models with the interpretability of forensic cues, to develop resilient and trustworthy deepfake detection systems.

[62] Mitigating Semantic Collapse in Partially Relevant Video Retrieval

WonJun Moon,MinSeok Jung,Gilhan Park,Tae-Young Kim,Cheol-Ho Cho,Woojin Jun,Jae-Pil Heo

Main category: cs.CV

TL;DR: 该论文针对部分相关视频检索(PRVR)中的语义塌陷问题,提出了文本相关性保护学习和跨分支视频对齐方法,显著提升了检索性能。

Details Motivation: 现有的PRVR方法将标注的文本-视频对视为正样本,其余为负样本,忽略了视频内和视频间的语义多样性,导致语义塌陷。

Contribution: 1. 提出文本相关性保护学习(Text Correlation Preservation Learning),保留基础模型的文本查询语义关系。2. 提出跨分支视频对齐(CBVA),通过对比对齐方法解耦视频的多层次表示。3. 引入保序标记合并和自适应CBVA,增强视频片段的内聚性和区分性。

Method: 1. 使用文本相关性保护学习保留文本语义关系。2. 采用CBVA通过对比学习对齐视频的多尺度表示。3. 结合保序标记合并和自适应CBVA优化视频片段生成。

Result: 在PRVR基准测试中,该方法有效防止了语义塌陷,显著提升了检索准确率。

Insight: 通过在文本和视频嵌入空间中同时解决语义塌陷问题,能够更好地处理视频中包含多个事件的复杂场景。

Abstract: Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.

[63] CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging

Aon Safdar,Mohamed Saadeldin

Main category: cs.CV

TL;DR: CoMViT是一种高效、轻量级的Vision Transformer架构,专为资源受限的医学图像分类任务设计,通过优化结构和引入新模块,显著减少了计算开销,同时保持了高精度和可解释性。

Details Motivation: 当前Vision Transformers在医学图像分析中因计算量大和在小数据集上易过拟合而受限,CoMViT旨在解决这些问题。

Contribution: 提出了CoMViT架构,集成了卷积tokenizer、对角线掩码、动态温度缩放和池化序列聚合等创新模块,显著降低了参数量和计算需求。

Method: 通过结构优化,结合卷积和Transformer模块,引入动态温度缩放和池化序列聚合,提升模型的泛化能力和效率。

Result: 在12个MedMNIST数据集上表现出色,参数量仅为~4.5M,与深层CNN和ViT变体相比,参数量减少5-20倍,精度不减。Grad-CAM分析显示其能关注临床相关区域。

Insight: 通过系统优化,轻量级的ViT架构在医学图像任务中可实现高效和可解释性,为资源受限场景提供了可行解决方案。

Abstract: Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.

[64] From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration

Jianwen Sun,Fanrui Zhang,Yukang Feng,Chuanhao Li,Zizhen Li,Jiaxin Ai,Yifan Chang,Yu Dai,Kaipeng Zhang

Main category: cs.CV

TL;DR: 该论文提出了VisPainter,一个基于多智能体框架的科学插图工具,解决了现有生成模型在语义结构和可编辑性方面的不足,并通过VisBench评估框架验证了其有效性。

Details Motivation: 科学插图需要高信息密度和可编辑性,但现有生成模型(如基于图像或代码的方法)无法同时满足这些需求。

Contribution: 1. 提出VisPainter多智能体框架,实现元素级控制和交互式编辑;2. 提出VisBench评估框架,从多个维度量化科学插图质量;3. 验证角色分工、步骤控制等对插图质量的影响。

Method: VisPainter通过Manager、Designer、Toolbox三个模块协作生成矢量图;VisBench用七维指标评估插图质量。

Result: 实验证明了架构合理性和评估可靠性,并给出了不同视觉语言模型的性能排名。

Insight: 模块化和角色分工是实现高效科学插图生成的关键;多维度评估能更全面反映模型能力。

Abstract: Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of “writing-compiling-reviewing” and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.

[65] Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Zhuoning Guo,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Xiaowen Chu

Main category: cs.CV

TL;DR: 该论文提出了一种通用视频检索框架,通过设计评估、数据和模型的协同方案,解决了当前视频检索范式在通用性上的局限性,并通过多模态金字塔课程学习实现了零样本泛化。

Details Motivation: 现有视频检索方法因数据和任务的局限性导致通用能力受限,缺乏多维度泛化的诊断评估。

Contribution: 1. 提出通用视频检索基准(UVRB),包含16个数据集以诊断多任务和跨域能力;2. 设计了可扩展的数据合成流程,生成了155万高质量数据对;3. 提出多模态金字塔课程学习方法和通用视频嵌入器(GVE)。

Method: 1. 设计UVRB基准;2. 数据合成流程生成多样化数据;3. 多模态金字塔课程学习训练GVE。

Result: GVE在UVRB上实现了零样本泛化的最优性能,揭示了现有基准对通用能力的预测性较差。

Insight: 部分相关检索是主导但被忽视的场景,协同设计框架为通用视频检索提供了实用路径。

Abstract: The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB’s diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

[66] NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Wei Xu,Cheng Wang,Dingkang Liang,Zongchuang Zhao,Xingyu Jiang,Peng Zhang,Xiang Bai

Main category: cs.CV

TL;DR: 论文提出了NAUTILUS,一种用于水下场景理解的大型多模态模型,通过构建大规模数据集NautData和引入可插拔的视觉特征增强模块VFE,显著提升了水下任务的性能。

Details Motivation: 水下探索在资源开发和国家安全等领域具有重要意义,但缺乏大规模的多任务指令调优数据集和图像退化问题阻碍了自动化水下场景理解的发展。

Contribution: 1. 构建了包含145万图像-文本对的数据集NautData,支持8种水下场景理解任务。2. 提出了物理先验驱动的视觉特征增强模块VFE,有效恢复水下清晰信息。3. 在基准模型LLaVA-1.5和Qwen2.5-VL上集成VFE模块,构建了水下LMM模型NAUTILUS。

Method: 1. 收集并构建NautData数据集。2. 设计VFE模块,基于水下成像模型的物理先验恢复图像特征。3. 将VFE模块集成到现有多模态模型中。

Result: 实验表明,VFE模块显著提升了基线模型的性能,NAUTILUS在NautData和公开数据集上表现优越,验证了方法的有效性。

Insight: 1. 大规模多任务数据集是水下场景理解的关键。2. 结合物理先验的视觉增强模块能有效解决水下图像退化问题。3. 多模态模型在水下任务中具有广阔应用前景。

Abstract: Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.

[67] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Jiawei Gu,Yunzhuo Hao,Huichen Will Wang,Linjie Li,Michael Qizhe Shieh,Yejin Choi,Ranjay Krishna,Yu Cheng

Main category: cs.CV

TL;DR: ThinkMorph是一个统一模型,通过高质量的文本和图像交替推理痕迹进行微调,展示了互补模态的优势,并在视觉为中心的任务上取得了显著提升。

Details Motivation: 多模态推理需要语言和视觉的迭代协调,但目前缺乏对交替思维链(chain-of-thought)的定义。文本和图像应作为互补模态协同推进推理。

Contribution: 提出了ThinkMorph模型,通过24K高质量交替推理痕迹微调,展示了互补模态的优势,并在性能上超越基线模型和专有VLMs。

Method: 构建ThinkMorph模型,引入渐进式文本和图像交替推理步骤,同时对视觉内容进行具体操作并保持连贯的语言逻辑。

Result: 在视觉为中心的任务上平均提升34.7%,并能泛化到领域外任务,展示了新兴的多模态智能(如视觉操作技能和自适应推理模式切换)。

Insight: 多模态统一模型的涌现能力可通过互补模态协同实现,为多模态推理的未来研究方向提供了启示。

Abstract: Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

[68] Context-Gated Cross-Modal Perception with Visual Mamba for PET-CT Lung Tumor Segmentation

Elena Mulero Ayllón,Linlin Shen,Pierangelo Veltri,Fabrizia Gelardi,Arturo Chiti,Paolo Soda,Matteo Tortora

Main category: cs.CV

TL;DR: 论文提出了一种轻量级多模态框架vMambaX,通过集成PET和CT图像,利用Context-Gated Cross-Modal Perception Module(CGM)实现自适应多模态特征交互,显著提升了肺肿瘤分割的准确性。

Details Motivation: 肺肿瘤分割的准确性对诊断和治疗计划至关重要,但如何有效结合PET和CT的解剖与功能信息仍是一个挑战。

Contribution: 提出了vMambaX框架,结合了Visual Mamba架构和CGM模块,实现了PET和CT图像的自适应特征交互,同时抑制噪声。

Method: 采用了Context-Gated Cross-Modal Perception Module(CGM),通过自适应的门控机制增强多模态特征交互,强调信息丰富的区域。

Result: 在PCLT20K数据集上表现出色,优于基线模型,同时保持了较低的计算复杂度。

Insight: 自适应多模态门控机制在多模态肿瘤分割中具有显著效果,vMambaX展示了其高效性和可扩展性。

Abstract: Accurate lung tumor segmentation is vital for improving diagnosis and treatment planning, and effectively combining anatomical and functional information from PET and CT remains a major challenge. In this study, we propose vMambaX, a lightweight multimodal framework integrating PET and CT scan images through a Context-Gated Cross-Modal Perception Module (CGM). Built on the Visual Mamba architecture, vMambaX adaptively enhances inter-modality feature interaction, emphasizing informative regions while suppressing noise. Evaluated on the PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity. These results highlight the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and demonstrate the potential of vMambaX as an efficient and scalable framework for advanced lung cancer analysis. The code is available at https://github.com/arco-group/vMambaX.

Khandoker Ashik Uz Zaman,Mohammad Zahangir Alam,Mohammed N. M. Ali,Mahdi H. Miraz

Main category: cs.CV

TL;DR: 论文提出了一种基于深度学习的3D点云水印框架,通过SVD嵌入二进制水印,并利用PointNet++提取水印,显著优于传统方法。

Details Motivation: 3D点云在版权保护方面面临独特挑战,传统水印易受几何和非几何攻击破坏,需一种鲁棒的水印方法。

Contribution: 提出了结合SVD和PointNet++的深度学习水印框架,实现了在高强度攻击下的鲁棒水印提取。

Method: 将水印嵌入3D点云块的奇异值,利用PointNet++提取水印,并通过训练使其抗攻击能力更强。

Result: 在ModelNet40数据集上验证,深度学习方法比特精度达0.83,IoU为0.80,显著优于传统SVD方法。

Insight: 深度学习可显著提升3D点云水印的抗攻击能力,为版权保护提供了新思路。

Abstract: The protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method’s ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.

[70] MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series

Xue Xia,Randall Balestriero,Tao Zhang,Yixin Zhou,Andrew Ding,Dev Saini,Lorenz Hurni

Main category: cs.CV

TL;DR: MapSAM2是一个基于视觉基础模型的框架,用于自动分割历史地图图像和时间序列,通过将地图图像和时间序列视为视频来处理,提高了分割精度,并减少了标注成本。

Details Motivation: 历史地图是重要的地理档案,但其自动分析因风格多变和标注数据稀缺而具有挑战性。构建时空关联数据集需大量人工,MapSAM2旨在解决这一问题。

Contribution: 1) 提出MapSAM2框架,统一处理历史地图图像和时间序列的分割任务;2) 将地图图像和时间序列视为视频,利用内存注意力机制提升几何精度;3) 引入Siegfried建筑时间序列数据集,并提出伪时间序列生成方法以减少标注成本。

Method: MapSAM2基于视觉基础模型,通过少量样本微调适应不同任务。地图图像被分块并作为视频处理,时间序列则通过伪视频生成技术模拟时间变换。

Result: 实验表明,MapSAM2能有效学习时间关联,在有限监督或使用伪视频的情况下,准确分割和关联时间序列中的建筑物。

Insight: 将历史地图和时间序列建模为视频是一种有效的方法,能够结合上下文信息并减少人工标注需求,为时空数据分析提供了新思路。

Abstract: Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.

[71] Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

Ilyass Moummad,Kawtar Zaher,Hervé Goëau,Alexis Joly

Main category: cs.CV

TL;DR: CroVCA提出了一种通过跨视角代码对齐学习二进制哈希码的简单统一方法,实现了高效的检索性能。

Details Motivation: 基础模型提供的高维嵌入虽强大但计算复杂,现有哈希方法通常依赖复杂流程或长训练时间。CroVCA旨在解决这些问题。

Contribution: 提出了CroVCA,一种简单统一的哈希学习原则,包括二进制交叉熵损失和编码率最大化正则化。设计了轻量级HashCoder网络。

Method: 使用二进制交叉熵损失实现跨视角对齐,编码率最大化防止代码塌陷。HashCoder作为轻量级MLP网络,适配基础模型嵌入或LoRA微调。

Result: 5个训练周期内实现SOTA性能,16位下在COCO和ImageNet100上分别仅需2和3分钟完成训练。

Insight: CroVCA展示了简单方法在高效哈希学习中的潜力,尤其适用于基础和轻量级场景。

Abstract: Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA’s efficiency, adaptability, and broad applicability.

[72] ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning

Samarup Bhattacharya,Anubhab Bhattacharya,Abir Chakraborty

Main category: cs.CV

TL;DR: ANCHOR结合对抗训练与硬挖掘的监督对比学习,提出了一种新的表示学习框架,显著提升了模型在干净和对抗样本上的鲁棒性与准确性。

Details Motivation: 现有的神经网络易受对抗攻击,即使微小的扰动也可能导致模型错误预测。为了提升模型的鲁棒性,同时保持高准确性,提出了结合对抗训练与对比学习的方法。

Contribution: 提出了ANCHOR框架,首次将对抗训练与硬挖掘的监督对比学习结合,利用对抗样本和硬正样本优化表示学习,提升了模型的鲁棒性和表示结构。

Method: 通过对抗训练生成扰动样本,并结合监督对比学习(硬挖掘策略)使同类样本及其扰动版本的嵌入在空间中聚集,同时远离异类样本。

Result: 在CIFAR-10上,ANCHOR在干净和对抗样本(PGD-20,ε=0.031)上的准确率均优于标准对抗训练方法,缩小了准确性与鲁棒性的差距。

Insight: 对抗训练与对比学习的结合可以引导模型学习更稳定和有意义的特征,而非依赖脆弱的梯度信号,从而提升鲁棒性。

Abstract: Neural networks have changed the way machines interpret the world. At their core, they learn by following gradients, adjusting their parameters step by step until they identify the most discriminant patterns in the data. This process gives them their strength, yet it also opens the door to a hidden flaw. The very gradients that help a model learn can also be used to produce small, imperceptible tweaks that cause the model to completely alter its decision. Such tweaks are called adversarial attacks. These attacks exploit this vulnerability by adding tiny, imperceptible changes to images that, while leaving them identical to the human eye, cause the model to make wrong predictions. In this work, we propose Adversarially-trained Contrastive Hard-mining for Optimized Robustness (ANCHOR), a framework that leverages the power of supervised contrastive learning with explicit hard positive mining to enable the model to learn representations for images such that the embeddings for the images, their augmentations, and their perturbed versions cluster together in the embedding space along with those for other images of the same class while being separated from images of other classes. This alignment helps the model focus on stable, meaningful patterns rather than fragile gradient cues. On CIFAR-10, our approach achieves impressive results for both clean and robust accuracy under PGD-20 (epsilon = 0.031), outperforming standard adversarial training methods. Our results indicate that combining adversarial guidance with hard-mined contrastive supervision helps models learn more structured and robust representations, narrowing the gap between accuracy and robustness.

[73] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Yuhong Liu,Beichen Zhang,Yuhang Zang,Yuhang Cao,Long Xing,Xiaoyi Dong,Haodong Duan,Dahua Lin,Jiaqi Wang

Main category: cs.CV

TL;DR: Spatial-SSRL是一种自监督强化学习范式,通过从普通RGB或RGB-D图像中提取可验证信号,提升了大型视觉语言模型(LVLM)的空间理解能力。

Details Motivation: 现有的大型视觉语言模型(LVLM)在空间理解方面表现较弱,而传统的监督微调(SFT)和强化学习方法依赖昂贵的监督或受限环境,难以扩展。Spatial-SSRL旨在通过自监督任务解决这一问题。

Contribution: 提出了Spatial-SSRL,一种无需人工或LVLM标注的自监督强化学习范式,通过五种自我监督任务捕捉2D和3D空间结构。

Method: 设计了五种前置任务:打乱补丁重排序、翻转补丁识别、裁剪补丁修复、区域深度排序和相对3D位置预测,这些任务提供了易于验证的答案。

Result: 在七个空间理解基准测试中,Spatial-SSRL平均提升了3B和7B模型的准确率分别为4.63%和3.89%。

Insight: 研究表明,简单的内在监督任务可以规模化地增强LVLM的空间推理能力,同时保持其通用视觉能力。

Abstract: Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.

[74] Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

John Won,Kyungmin Lee,Huiwon Jang,Dongyoung Kim,Jinwoo Shin

Main category: cs.CV

TL;DR: 论文提出了一种名为DUST的双流扩散框架,用于增强视觉-语言-动作模型(VLAs)的世界建模能力,通过多模态扩散变换器实现跨模态知识共享,显著提升了性能。

Details Motivation: 当前VLAs在联合预测下一状态观测和动作序列时,由于模态间的固有差异存在挑战。DUST旨在解决这种模态冲突,并提升VLAs的跨任务性能。

Contribution: 1. 提出DUST框架,通过双流扩散和多模态扩散变换器实现模态分离与知识共享;2. 引入独立噪声扰动和解耦流匹配损失;3. 提出测试时尺度化的联合采样方法。

Method: 1. 使用双流设计分离模态;2. 在多模态扩散变换器中实现跨模态交互;3. 采用独立噪声和解耦损失;4. 支持不同速率的异步采样。

Result: 在RoboCasa和GR-1基准上提升6%,测试时尺度化再提升2-5%;在Franka Research 3真实任务中成功率提高13%;BridgeV2预训练带来显著迁移增益。

Insight: DUST通过解耦模态训练和测试时异步采样,解决了多模态联合建模的冲突,同时为大规模VLA预训练提供了潜力。

Abstract: Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST’s potential for large-scale VLA pretraining.

[75] Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation

Riccardo Brioschi,Aleksandr Alekseev,Emanuele Nevali,Berkay Döner,Omar El Malki,Blagoj Mitrevski,Leandro Kieliger,Mark Collier,Andrii Maksai,Jesse Berent,Claudiu Musat,Efi Kokiopoulou

Main category: cs.CV

TL;DR: 论文提出了一种基于用户草图的直观约束方法(Sketch-to-Layout),通过多模态Transformer模型生成高质量布局,并引入了一种高效合成草图数据的方法。

Details Motivation: 现有布局生成方法需要复杂的用户约束,降低了可用性。草图作为一种直观的约束方式,尚未得到充分探索。

Contribution: 1. 提出了草图引导的多模态布局生成问题(Sketch-to-Layout);2. 设计了一种基于Transformer的多模态解决方案;3. 提出了一种高效合成草图数据的方法;4. 在多个公开数据集上验证了方法的优越性。

Method: 使用多模态Transformer模型,将用户草图和内容资源作为输入生成布局。为解决训练数据不足问题,提出了一种合成草图数据的方法。

Result: 在PubLayNet、DocLayNet和SlidesVQA数据集上表现优于现有约束方法,同时提升了设计体验的直观性。

Insight: 草图是一种高效的布局约束方式,合成数据方法可缓解标注成本问题,为相关研究提供了新工具。

Abstract: Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.

[76] NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception

Congzhang Shao,Quan Yuan,Guiyang Luo,Yue Hu,Danni Wang,Yilin Liu,Rui Pan,Bo Chen,Jinglin Li

Main category: cs.CV

TL;DR: NegoCollab提出了一种基于协商共享表示的异构协作感知方法,通过引入协商器来减少不同模态代理之间的固有领域差距,并通过多种对齐损失实现特征转换。

Details Motivation: 异构协作感知中,不同代理使用固定感知模型导致特征共享时的领域差距问题,影响协作性能。现有方法中公共表示通常被指定为某一代理的表示,难以解决显著领域差异问题。

Contribution: 提出了NegoCollab方法,通过协商共享表示和特征转换机制,减少了异构代理之间的领域差距,提升了协作性能。

Method: 引入协商器生成公共表示,利用发送者-接收者对实现特征空间转换,并通过结构对齐、实用对齐和分布对齐损失监督训练。

Result: NegoCollab有效减少了领域差距,知识能够充分提取到发送者中,提升了协作感知性能。

Insight: 通过协商而非指定公共表示,能够更好地适应异构代理的差异性,多损失监督有助于更全面的特征对齐。

Abstract: Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality’s agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.

[77] Deep learning denoising unlocks quantitative insights in operando materials microscopy

Samuel Degnan-Morgenstern,Alexander E. Cohen,Rajeev Gopal,Megan Gober,George J. Nelson,Peng Bai,Martin Z. Bazant

Main category: cs.CV

TL;DR: 该论文提出了一个基于无监督深度学习的去噪框架,用于提升原位材料显微镜的定量分析能力,展示了其在多种显微技术中的有效性。

Details Motivation: 原位显微镜技术能直接观察功能材料的动态化学和物理过程,但噪声限制了其分辨率和定量分析能力,因此需要一种通用的去噪方法。

Contribution: 提出了一种通用的无监督深度学习去噪框架,可应用于多种显微技术,且在保留物理保真度的同时显著降低噪声。

Method: 通过模拟数据验证深度学习去噪的物理保真性和无偏性,并结合PDE约束优化减少模型学习中的不确定性。在实际实验中,方法被应用于STXM、光学显微镜和中子成像技术。

Result: 方法成功揭示了LFP纳米尺度的化学与结构异质性,实现了石墨电极的光学显微镜自动化分割与相位分类,并将中子成像中的噪声变异性降低了近80%。

Insight: 深度学习去噪是一种模态无关的增强技术,能够显著提升原位成像的定量分析能力,扩展了传统噪声受限技术的应用范围。

Abstract: Operando microscopy provides direct insight into the dynamic chemical and physical processes that govern functional materials, yet measurement noise limits the effective resolution and undermines quantitative analysis. Here, we present a general framework for integrating unsupervised deep learning-based denoising into quantitative microscopy workflows across modalities and length scales. Using simulated data, we demonstrate that deep denoising preserves physical fidelity, introduces minimal bias, and reduces uncertainty in model learning with partial differential equation (PDE)-constrained optimization. Applied to experiments, denoising reveals nanoscale chemical and structural heterogeneity in scanning transmission X-ray microscopy (STXM) of lithium iron phosphate (LFP), enables automated particle segmentation and phase classification in optical microscopy of graphite electrodes, and reduces noise-induced variability by nearly 80% in neutron radiography to resolve heterogeneous lithium transport. Collectively, these results establish deep denoising as a powerful, modality-agnostic enhancement that advances quantitative operando imaging and extends the reach of previously noise-limited techniques.

[78] Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

Bo Li,Duyuan Zheng,Xinyang Liu,Qingwen Li,Hong Li,Hongyan Cui,Ge Gao,Chen Liu

Main category: cs.CV

TL;DR: 论文提出了一种轻量级且鲁棒的模型Sh-ViT,用于解决监控场景中的遮挡行人重识别问题。通过引入Shuffle模块、场景适应增强和知识蒸馏技术,Sh-ViT在遮挡和模糊条件下表现出色。

Details Motivation: 监控场景中的行人重识别常受遮挡、视角扭曲和图像质量差的干扰,现有方法通常依赖复杂模块或仅对清晰正面图像有效。

Contribution: 1. 提出Sh-ViT模型,引入Shuffle模块增强对遮挡的鲁棒性;2. 设计场景适应增强方法模拟真实监控条件;3. 采用DeiT知识蒸馏提升有限标签下的学习能力;4. 发布MyTT数据集支持真实场景评估。

Method: 1. 在ViT-Base基础上,嵌入Shuffle模块打破空间相关性;2. 通过几何变换、擦除、模糊和颜色调整等增强数据;3. 基于DeiT的知识蒸馏优化模型训练。

Result: 在MyTT数据集上,Sh-ViT达到83.2% Rank-1和80.1% mAP;在Market1501上,达到94.6% Rank-1和87.5% mAP,超越现有方法。

Insight: 通过轻量级设计(Shuffle模块和知识蒸馏),模型在不增加外部模块的情况下显著提升遮挡条件下的性能,适合实际监控应用。

Abstract: Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.

[79] PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

Danyal Maqbool,Changhee Lee,Zachary Huemann,Samuel D. Church,Matthew E. Larson,Scott B. Perlman,Tomas A. Romero,Joshua D. Warner,Meghan Lubner,Xin Tie,Jameson Merkow,Junjie Hu,Steve Y. Cho,Tyler J. Bradshaw

Main category: cs.CV

TL;DR: PETAR-4B是一个针对3D PET/CT成像的视觉语言模型,通过结合PET、CT和病灶轮廓实现空间感知的报告生成,显著提升了PET/CT报告的质量和临床实用性。

Details Motivation: 现有视觉语言模型(VLMs)主要针对2D成像应用,而3D PET/CT领域面临大体积数据、小且分散的病灶以及冗长报告等挑战,需要更先进的模型支持。

Contribution: 提出了PETAR-4B,首个结合PET、CT和病灶轮廓的3D掩码视觉语言模型,能够生成空间感知的PET/CT报告;同时发布了一个大规模数据集(11,000+病灶描述,5,000+ PET/CT检查)。

Method: 采用混合规则和大语言模型(LLM)的流程提取数据,设计3D掩码感知的视觉语言模型,整合PET、CT和病灶轮廓,实现全局与局部病灶的联合推理。

Result: 自动和人工评估显示,PETAR显著提升了PET/CT报告生成的质量,推动了3D医学视觉语言理解的发展。

Insight: 通过结合3D掩码和病灶轮廓,PETAR展示了在复杂医学影像中实现精细化空间定位和报告生成的潜力;大规模高质量数据集的构建也为未来研究提供了重要资源。

Abstract: Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.

[80] Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Xiangyu Fan,Zesong Qiu,Zhuguanyu Wu,Fanzhou Wang,Zhiqian Lin,Tianxiang Ren,Dahua Lin,Ruihao Gong,Lei Yang

Main category: cs.CV

TL;DR: Phased DMD提出了一种多步蒸馏框架,通过分阶段蒸馏和子区间内的score matching,解决一步蒸馏模型在复杂生成任务上表现不佳的问题,同时避免了多步蒸馏的计算和内存开销问题。

Details Motivation: 传统的一步分布匹配蒸馏(DMD)在复杂生成任务(如文本到视频生成)中表现不佳,而直接扩展为多步蒸馏又会带来计算和内存负担。因此,需要一种既能保留多样性又能高效训练的方法。

Contribution: 1. 引入了Phased DMD框架,结合分阶段蒸馏和MoE思想;2. 提出了渐进式分布匹配和子区间内的score matching方法;3. 在高参数量的图像和视频生成模型上验证了方法有效性。

Method: 1. 将SNR范围划分为子区间,逐步细化模型;2. 在每个子区间内进行严格的数学推导以确保训练目标的准确性;3. 结合Mixture-of-Experts(MoE)提升模型容量。

Result: Phased DMD在保留生成多样性的同时,显著提升了模型在复杂任务上的表现,实验验证了其在20B和28B参数模型上的有效性。

Insight: 分阶段训练和子区间优化是解决复杂生成任务的有效策略,能够平衡模型性能和计算开销。

Abstract: Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.

eess.AS [Back]

[81] See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

Jinting Wang,Jun Wang,Hei Victor Cheng,Li Liu

Main category: eess.AS

TL;DR: 该论文提出了一种直接从语音中提取信息的方法,用于生成高分辨率的说话人脸视频,无需依赖源图像作为外观参考。该方法分为语音到人脸肖像生成阶段和语音驱动的说话人脸生成阶段,结合了扩散模型和区域增强模块,实现了高质量的嘴唇同步和细节增强。

Details Motivation: 现有方法通常依赖源图像作为外观参考,并使用源语音生成动作,这限制了生成过程的灵活性和质量。为了克服这些限制,论文提出了一种直接从语音中提取信息的方法,以生成高质量、高分辨率的说话人脸视频。

Contribution: 1. 提出了一个直接从语音生成高分辨率说话人脸视频的新方法,无需依赖源图像。2. 结合了扩散模型和区域增强模块,优化了嘴唇同步和细节表现。3. 首次实现了仅从单语音输入生成高质量说话人脸视频的能力。

Method: 方法分为两个阶段:(1) 语音到人脸肖像生成阶段,利用语音条件扩散模型和统计面部先验,结合样本自适应加权模块生成高质量肖像;(2) 语音驱动的说话人脸生成阶段,通过区域增强模块优化嘴唇同步,并使用Transformer离散码本和图像渲染网络增强视频帧细节。

Result: 实验结果表明,该方法在HDTF、VoxCeleb和AVSpeech数据集上优于现有方法,能够生成高分辨率、高质量的说话人脸视频。

Insight: 直接从语音生成说话人脸视频是一项重要且具有挑战性的任务,该方法通过结合扩散模型和区域增强技术,展示了在没有源图像的情况下仍能生成高质量结果的可能性。

Abstract: Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.

cs.IR [Back]

[82] Evaluating Perspectival Biases in Cross-Modal Retrieval

Teerapol Saengsukhiran,Peerawat Chomphooyod,Narabodee Rodjananant,Chompakorn Chaksangchaichot,Patawee Prakrankamanant,Witthawin Sripheanpol,Pak Lovichit,SarChaksaana Nutanong,Ekapol Chuangsuwanich

Main category: cs.IR

TL;DR: 本文研究了跨模态检索中的视角偏差,揭示了语言普遍性和文化关联对检索结果的系统性影响,并提出针对性的偏差缓解策略。

Details Motivation: 多模态检索系统通常假设其在语义空间中是语言和文化无关的,但实际上存在视角偏差。本文旨在研究并解决两种偏差:语言普遍性导致的偏差和文化关联导致的偏差。

Contribution: 本文的主要贡献是识别并分析了跨模态检索中的两种视角偏差:普遍性偏差和关联偏差,并探讨了不同的缓解策略。

Method: 通过实验研究了两种偏差的表现形式,并比较了不同策略(如显式对齐)对偏差的缓解效果。

Result: 结果显示,显式对齐能有效缓解普遍性偏差,但关联偏差仍然是更具挑战性的问题。

Insight: 实现真正公平的多模态系统需要针对不同偏差采取特定策略,且文化关联造成的偏差比语言普遍性造成的偏差更难解决。

Abstract: Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.

cs.DB [Back]

[83] DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries

Chuxuan Hu,Maxwell Yang,James Weiland,Yeji Lim,Suhas Palawala,Daniel Kang

Main category: cs.DB

TL;DR: DRAMA提出了一种端到端范式,通过自然语言处理大规模开放域数据,统一数据收集、转换和分析为一个流程,并在DRAMA-Bench基准测试中表现优异。

Details Motivation: 手动进行现实世界数据分析效率低下,现有系统无法同时支持开放域数据收集、结构化数据转换和分析推理这三项关键能力。

Contribution: DRAMA首次将数据收集、转换和分析统一为单条流水线,提出了DRAMA-Bench基准测试,并开发了高性能且低成本的DRAMA-Bot多智能体系统。

Method: DRAMA通过多智能体系统实现,包含数据检索器(协调子智能体收集和转换数据)和数据分析器(对检索数据进行结构化推理)。

Result: 在DRAMA-Bench上,DRAMA-Bot任务准确率达到86.5%,成本仅0.05美元,性能优于基线方法且成本更低。

Insight: 统一数据收集和分析的端到端范式可以显著提升开放域数据分析任务的效率和准确性。

Abstract: Manually conducting real-world data analyses is labor-intensive and inefficient. Despite numerous attempts to automate data science workflows, none of the existing paradigms or systems fully demonstrate all three key capabilities required to support them effectively: (1) open-domain data collection, (2) structured data transformation, and (3) analytic reasoning. To overcome these limitations, we propose DRAMA, an end-to-end paradigm that answers users’ analytic queries in natural language on large-scale open-domain data. DRAMA unifies data collection, transformation, and analysis as a single pipeline. To quantitatively evaluate system performance on tasks representative of DRAMA, we construct a benchmark, DRAMA-Bench, consisting of two categories of tasks: claim verification and question answering, each comprising 100 instances. These tasks are derived from real-world applications that have gained significant public attention and require the retrieval and analysis of open-domain data. We develop DRAMA-Bot, a multi-agent system designed following DRAMA. It comprises a data retriever that collects and transforms data by coordinating the execution of sub-agents, and a data analyzer that performs structured reasoning over the retrieved data. We evaluate DRAMA-Bot on DRAMA-Bench together with five state-of-the-art baseline agents. DRAMA-Bot achieves 86.5% task accuracy at a cost of $0.05, outperforming all baselines with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is publicly available at https://github.com/uiuc-kang-lab/drama.

cs.CY [Back]

Riley Grossman,Michael Smith,Cristian Borcea,Yi Chen

Main category: cs.CY

TL;DR: 该论文通过显著性目标检测技术研究GDPR合规的Cookie横幅中美学操纵(如吸引用户注意的按钮设计)的频率,发现38%的合规横幅存在操纵行为,且欧盟网站更可能使用此类设计。

Details Motivation: 研究GDPR合规的Cookie横幅中美学操纵行为的普遍性,并评估其合规性及对用户决策的潜在影响。

Contribution: 1) 首次使用显著性目标检测技术量化横幅的视觉显著性;2) 发现美学操纵比以往报告的更常见;3) 揭示了欧盟与非欧盟网站在设计上的差异。

Method: 1) 访问2,579个网站并分类其Cookie横幅;2) 使用计算机视觉模型检测显著性元素;3) 对比欧盟与非欧盟网站的设计差异。

Result: 45%的网站合规,38%的合规横幅存在美学操纵;欧盟网站比非欧盟网站使用操纵设计的可能性高48.3%。

Insight: 美学操纵在合规横幅中广泛存在,技术方法能更客观地检测此类行为;欧盟网站的应对策略更具创新性。

Abstract: The main goal of this paper is to study how often cookie banners that comply with the General Data Protection Regulation (GDPR) contain aesthetic manipulation, a design tactic to draw users’ attention to the button that permits personal data sharing. As a byproduct of this goal, we also evaluate how frequently the banners comply with GDPR and the recommendations of national data protection authorities regarding banner designs. We visited 2,579 websites and identified the type of cookie banner implemented. Although 45% of the relevant websites have fully compliant banners, we found aesthetic manipulation on 38% of the compliant banners. Unlike prior studies of aesthetic manipulation, we use a computer vision model for salient object detection to measure how salient (i.e., attention-drawing) each banner element is. This enables the discovery of new types of aesthetic manipulation (e.g., button placement), and leads us to conclude that aesthetic manipulation is more common than previously reported (38% vs 27% of banners). To study the effects of user and/or website location on cookie banner design, we include websites within the European Union (EU), where privacy regulation enforcement is more stringent, and websites outside the EU. We visited websites from IP addresses in the EU and from IP addresses in the United States (US). We find that 13.9% of EU websites change their banner design when the user is from the US, and EU websites are roughly 48.3% more likely to use aesthetic manipulation than non-EU websites, highlighting their innovative responses to privacy regulation.

cs.RO [Back]

[85] A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

Simindokht Jahangard,Mehrzad Mohammadi,Abhinav Dhall,Hamid Rezatofighi

Main category: cs.RO

TL;DR: 本文提出了一种新型的神经符号框架,结合全景图像和3D点云信息,通过神经感知与符号推理的融合,显著提升了机器人在复杂环境中的空间推理能力。

Details Motivation: 现有的视觉语言模型(VLMs)在感知任务上表现优异,但在细粒度空间推理上存在不足,尤其是在机器人领域需要理解对象间复杂关系的任务中。

Contribution: 主要贡献是提出了一种集成了神经感知与符号推理的多模态神经符号框架,能够显式建模空间和逻辑关系,支持精确且可解释的查询。

Method: 框架包括一个感知模块(用于检测实体和提取属性)和一个推理模块(构建结构化场景图),结合了全景图像和3D点云的多模态数据。

Result: 在JRDB-Reasoning数据集上的实验表明,该方法在拥挤的人类建造环境中表现出卓越的性能和可靠性,同时保持了轻量级设计。

Insight: 通过显式建模空间关系和多模态数据融合,能够显著提升复杂环境中的视觉推理能力,为机器人应用提供了高效的解决方案。

Abstract: Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.

physics.med-ph [Back]

[86] Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models

Joyoni Dey,Hunter C. Meyer,Murtuza S. Taqi

Main category: physics.med-ph

TL;DR: 该论文研究了X射线暗场成像(DFI)结合深度学习分割技术,显著提高了对小鼠肺部早期合成肿瘤的检测效果。

Details Motivation: 低剂量CT(LDCT)是目前肺癌筛查的标准方法,但其推广和可及性有限,且存在较高的假阳性率。作者希望通过DFI技术改善这一问题。

Contribution: 论文的主要贡献是证明了DFI技术结合深度学习模型(U-Net)可以显著提高早期肺肿瘤的检测率,并提出了DFI作为低成本、低剂量筛查替代方法的潜力。

Method: 研究使用小鼠肺部的配对衰减成像(ATTN)和DFI图像生成合成肿瘤数据,并利用U-Net分割网络在三种输入条件下(仅ATTN、仅DFI、ATTN+DFI)进行训练和比较。

Result: 结果表明,DFI-only模型的检测敏感度(83.7%)显著高于ATTN-only模型(51%),特异性相近(90.5% vs. 92.9%)。ATTN+DFI联合输入则达到了79.6%的敏感性和97.6%的特异性。

Insight: DFI技术对小角度散射更敏感,且不易受器官阴影干扰,因此更适合早期肿瘤检测。这一发现为资源有限的地区提供了可行的低成本筛查方案。

Abstract: Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI channels. Results show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.

cs.LG [Back]

[87] Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Austin Meek,Eitan Sprejer,Iván Arcuschin,Austin J. Brockmeier,Steven Basart

Main category: cs.LG

TL;DR: 该论文提出了一种衡量思维链(CoT)可监控性的方法,通过结合忠实性和冗余性来评估CoT的质量,揭示了模型在不同任务中的表现差异。

Details Motivation: 思维链的可监控性对于发现模型的不安全或不一致行为至关重要,但现有方法仅关注模型答案变化的场景,忽略了其他关键因素。

Contribution: 提出了将忠实性和冗余性结合的可监控性评分方法,并开源了评估代码。

Method: 通过量化CoT的忠实性和冗余性,构建了一个综合评分体系,并在BBH、GPQA和MMLU数据集上进行评估。

Result: 研究发现,某些模型看似忠实但可监控性差,且不同模型家族的可监控性差异显著。

Insight: 思维链的冗余性是衡量其可监控性的重要指标,未来工作可以在此基础上优化模型的透明度和安全性。

Abstract: Chain-of-thought (CoT) outputs let us read a model’s step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model’s external `working memory’, a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.

[88] Atlas-Alignment: Making Interpretability Transferable Across Language Models

Bruno Puri,Jim Berend,Sebastian Lapuschkin,Wojciech Samek

Main category: cs.LG

TL;DR: Atlas-Alignment提出了一种框架,通过将未知的潜在空间与标记化的“概念地图”对齐,实现解释性在不同语言模型间的迁移,从而降低成本并提升可扩展性。

Details Motivation: 现有解释性方法成本高且难以扩展,为新模型训练稀疏自编码器和标注组件需要大量资源。研究旨在通过共享潜在空间对齐技术,降低解释性AI的成本。

Contribution: 1. 引入Atlas-Alignment框架,通过潜在空间对齐实现解释性迁移;2. 实现语义特征搜索和可解释生成,无需额外标注数据。

Method: 使用轻量级表示对齐技术,将目标模型的潜在空间与预先构建的标记化“概念地图”对齐,从而共享已标注的解释性特征。

Result: 定性与定量评估表明,该方法支持高效的语义检索和可控生成,且无需额外标注数据。

Insight: 通过构建一次高质量的概念地图,可为多个新模型提供低成本的透明性和可控性,显著降低解释性AI的边际成本。

Abstract: Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.

[89] Thought Branches: Interpreting LLM Reasoning Requires Resampling

Uzay Macar,Paul C. Bogdan,Senthooran Rajamanoharan,Neel Nanda

Main category: cs.LG

TL;DR: 论文提出通过重采样(resampling)方法研究大型语言模型(LLM)的推理分布,而非单一思维链(CoT),以更可靠地分析因果影响、指导干预,并解释模型行为。

Details Motivation: 现有的研究通常只分析单一的思维链,而模型实际上定义了多个可能的推理路径。这种方式难以全面理解因果影响和计算本质,因此需要一种更系统的方法。

Contribution: 1. 提出用重采样方法研究LLM推理分布;2. 通过案例展示了重采样在因果分析、干预和模型解释中的有效性;3. 提出了衡量推理步骤移除效果的韧性指标。

Method: 1. 在“agentic misalignment”场景中重采样特定句子测量因果影响;2. 对比人工编辑思维链与重采样干预的效果;3. 引入韧性指标通过多次重采样防止移除内容重现;4. 应用因果中介分析研究隐性提示的影响。

Result: 1. 自保句对黑邮行为因果影响小;2. 人工编辑干预效果不稳定,重采样更可靠;3. 关键规划步骤韧性高但移除影响大;4. 隐性提示对输出有累积性影响。

Insight: 重采样为理解LLM推理提供了可靠方法,揭示了干预策略的差异和隐性因素的作用,有助于改进模型解释和操控。

Abstract: Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In “agentic misalignment” scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes “unfaithful”, can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.

cs.SD [Back]

[90] Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Jiarong Du,Zhan Jin,Peijun Yang,Juan Liu,Zhuo Li,Xin Liu,Ming Li

Main category: cs.SD

TL;DR: 论文提出了一种在复杂声学环境中表现优异的视听语音增强(AVSE)系统,通过‘先分离后去混响’的流程,结合分离与去混响的联合建模,显著提升了语音质量。

Details Motivation: 现实场景中,复杂声学环境(如干扰声和混响)导致传统AVSE方法表现不佳,亟需一种高效的多模态语音增强解决方案。

Contribution: 提出了一种‘先分离后去混响’的新流程,可扩展到其他AVSE网络,并在分离与去混响联合建模方面取得突破。

Method: 采用‘分离-去混响’的联合建模方法,结合视听信息优化语音增强流程,显著提升了复杂环境中的语音提取效果。

Result: 在AVSEC-4竞赛中,系统在客观指标和主观听测中均取得最优成绩,位列榜首。

Insight: 多模态信息(视觉辅助)与联合建模(分离+去混响)的结合,是提升复杂声学环境下语音增强效果的关键。

Abstract: Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker’s speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a “separation before dereverberation” pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.

eess.IV [Back]

[91] A fragile zero-watermarking method based on dual quaternion matrix decomposition

Mingcui Zhang,Zhigang Jia

Main category: eess.IV

TL;DR: 本文提出了一种基于双四元数矩阵分解的脆弱零水印方法,用于保护医学图像的版权和检测内容篡改。

Details Motivation: 医学图像在诊断和远程会诊中至关重要,但传输和共享过程中易受版权和内容篡改的威胁,需要有效保护。

Contribution: 提出了一种新型脆弱零水印技术,结合双四元数的运算关系和矩阵分解特性,实现了医学图像的版权保护和内容篡改检测。

Method: 利用双四元数的标准部分和对偶部分的运算关系,提取图像稳定特征生成零水印信息。

Result: 实现了医学图像的版权保护和篡改检测,不修改原始图像。

Insight: 双四元数矩阵分解为医学图像保护提供了新的技术途径,兼顾了版权保护和内容完整性验证。

Abstract: Medical images play a crucial role in assisting diagnosis, remote consultation, and academic research. However, during the transmission and sharing process, they face serious risks of copyright ownership and content tampering. Therefore, protecting medical images is of great importance. As an effective means of image copyright protection, zero-watermarking technology focuses on constructing watermarks without modifying the original carrier by extracting its stable features, which provides an ideal approach for protecting medical images. This paper aims to propose a fragile zero-watermarking model based on dual quaternion matrix decomposition, which utilizes the operational relationship between the standard part and the dual part of dual quaternions to correlate the original carrier image with the watermark image, and generates zero-watermarking information based on the characteristics of dual quaternion matrix decomposition, ultimately achieving copyright protection and content tampering detection for medical images.

cs.AI [Back]

[92] The Denario project: Deep knowledge AI agents for scientific discovery

Francisco Villaescusa-Navarro,Boris Bolliet,Pablo Villanueva-Domingo,Adrian E. Bayer,Aidan Acquah,Chetana Amancharla,Almog Barzilay-Siegal,Pablo Bermejo,Camille Bilodeau,Pablo Cárdenas Ramírez,Miles Cranmer,Urbano L. França,ChangHoon Hahn,Yan-Fei Jiang,Raul Jimenez,Jun-Young Lee,Antonio Lerario,Osman Mamun,Thomas Meier,Anupam A. Ojha,Pavlos Protopapas,Shimanto Roy,David N. Spergel,Pedro Tarancón-Álvarez,Ujjwal Tiwari,Matteo Viel,Digvijay Wadekar,Chi Wang,Bonny Y. Wang,Licong Xu,Yossi Yovel,Shuwen Yue,Wen-Han Zhou,Qiyao Zhu,Jiajun Zou,Íñigo Zubeldia

Main category: cs.AI

TL;DR: Denario是一个模块化的AI多智能体系统,旨在作为科学研究的助手,能够完成从生成想法到撰写科学论文的全过程任务。

Details Motivation: 为了解决科学研究中的复杂性和跨学科需求,Denario旨在通过AI技术辅助科学家完成多样化的研究任务,提高研究效率。

Contribution: Denario的核心贡献是其模块化架构和多功能性,能够跨多个科学领域生成高质量的研究成果,并通过专家评估验证其能力。

Method: 系统采用模块化设计,结合Cmbagent作为深度研究后端,支持从想法生成到论文撰写的端到端科学分析任务。

Result: Denario在多个学科(如天体物理学、生物学等)中生成了高质量论文,并通过专家评审证明了其有效性。同时展示了跨学科研究的潜力。

Insight: Denario展示了AI在科学研究中的潜力,但也揭示了当前系统的局限性和伦理问题,为AI驱动的研究提供了哲学反思。

Abstract: We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at https://github.com/AstroPilot-AI/Denario. A Denario demo can also be run directly on the web at https://huggingface.co/spaces/astropilot-ai/Denario, and the full app will be deployed on the cloud.

[93] Glia: A Human-Inspired AI for Automated Systems Design and Optimization

Pouya Hamadanian,Pantea Karimi,Arash Nasr-Esfahany,Kimia Noorbakhsh,Joseph Chandler,Ali ParandehGheibi,Mohammad Alizadeh,Hari Balakrishnan

Main category: cs.AI

TL;DR: Glia是一种受人类启发的AI架构,用于自动化系统设计与优化,通过多智能体工作流程结合LLM,生成可解释的系统设计方案。

Details Motivation: 研究目标是探索AI是否能够自主设计计算机系统机制,达到与人类专家相当的创造力和推理能力。

Contribution: 提出了Glia架构,通过多智能体协作框架结合LLM,生成的系统设计不仅性能优越,还能提供可解释的设计过程和推理。

Method: Glia采用多智能体工作流程,各智能体分别专注于推理、实验和分析,并通过评估框架协作,将抽象推理与实验反馈结合。

Result: 应用于分布式GPU集群时,Glia生成了新的请求路由、调度和自动扩展算法,性能达到人类专家水平,并揭示了工作负载行为的新见解。

Insight: 研究表明,将推理型LLM与结构化实验结合,AI可以生成复杂系统问题的创造性且易于理解的设计方案。

Abstract: Can an AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired, multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning process. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that by combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.

[94] DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains

Tian Liang,Wenxiang Jiao,Zhiwei He,Jiahao Xu,Haitao Mi,Dong Yu

Main category: cs.AI

TL;DR: DeepCompress是一种双奖励策略框架,动态调整推理链长度,简单问题压缩推理,复杂问题延长探索,提升准确性和效率。

Details Motivation: 大型推理模型存在认知效率问题(如简单问题过度思考,复杂问题思考不足),现有方法在提升效率时可能牺牲准确率。

Contribution: 提出双奖励机制(动态分类问题为简单/复杂)和自适应推理链调整,同时提升模型的准确性和效率。

Method: 通过自适应长度奖励机制动态分类问题,简单问题鼓励短推理压缩,复杂问题延长推理链探索更多解决方案。

Result: 在数学基准测试中,DeepCompress显著优于基线方法,准确率和token效率均提升。

Insight: 推理链长度应根据问题复杂度动态调整,而非一味偏好短路径;长推理链对复杂问题更有效。

Abstract: Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like overthinking'' simple problems and underthinking’’ complex ones. While existing methods that use supervised fine-tuning(SFT) or reinforcement learning(RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces \textbf{DeepCompress}, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as Simple'' or Hard’’ in real-time based on the model’s evolving capability. It encourages shorter, more efficient reasoning for Simple'' problems while promoting longer, more exploratory thought chains for Hard’’ problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.

[95] SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

Ali Asgarov,Umid Suleymanov,Aadyant Khatri

Main category: cs.AI

TL;DR: SIGMA是一个多智能体框架,通过独立推理、目标搜索和调和机制整合知识,显著提高了数学推理问题的解决能力。

Details Motivation: 当前基于检索增强的模型在解决数学推理问题时存在视角单一、搜索策略僵化和多源信息整合困难的问题。

Contribution: 提出了SIGMA框架,通过多智能体协同实现上下文敏感且高效的知识整合,显著提升了数学模型推理的性能。

Method: 采用多智能体分工协作,每个智能体生成假设段落以优化检索,并通过调和机制整合信息。

Result: 在MATH500、AIME和GPQA等基准测试上,SIGMA性能提升7.4%,优于开源和闭源系统。

Insight: 多智能体按需知识整合是复杂推理任务中的有效方法,能够兼顾准确性和效率。

Abstract: Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.

[96] Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

Qiusi Zhan,Hyeonjeong Ha,Rui Yang,Sirui Xu,Hanyang Chen,Liang-Yan Gui,Yu-Xiong Wang,Huan Zhang,Heng Ji,Daniel Kang

Main category: cs.AI

TL;DR: 论文提出了BEAT框架,首次在多模态大语言模型(MLLM)驱动的具身智能体中植入基于视觉触发的后门攻击,通过对比触发学习(CTL)显著提高后门激活成功率。

Details Motivation: MLLM驱动的具身智能体为任务规划和决策提供了直接感知能力,但也引入了新的攻击面——视觉后门攻击。攻击者在环境中植入触发对象时,智能体会执行恶意多步策略。论文旨在探索这一未被充分研究的安全风险。

Contribution: 1) 首次提出针对MLLM具身智能体的视觉后门攻击框架BEAT;2) 引入对比触发学习(CTL),通过偏好学习显式区分触发存在与不存在的输入,提升后门激活精度;3) 在多个基准测试中验证攻击高效性(成功率高达80%)。

Method: 1) 构建包含多样化场景、任务和触发位置的训练集;2) 采用两阶段训练方案:监督微调(SFT)后引入CTL,通过对比学习增强触发辨别能力。

Result: BEAT在多种MLLM和具身任务中实现了高达80%的攻击成功率,同时保持正常任务性能。CTL在有限后门数据下将激活准确率提升39%。

Insight: 视觉后门攻击对MLLM具身智能体构成严重威胁,现有技术难以抵御,突显实际部署前需加强防御机制。

Abstract: Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.

[97] GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

Tao Liu,Chongyu Wang,Rongjie Li,Yingchen Yu,Xuming He,Bai Song

Main category: cs.AI

TL;DR: GUI-Rise是一个增强GUI导航代理的框架,结合结构化推理、动作预测和历史摘要,通过监督微调和强化学习训练,在跨域场景中表现优异。

Details Motivation: 当前多模态大语言模型(MLLMs)在GUI导航代理中存在跨域泛化和历史利用不足的问题。本文旨在通过结构化推理和历史摘要解决这些问题。

Contribution: 1. 提出一个推理增强框架,结合结构化推理和紧凑历史摘要;2. 训练GUI-Rise代理,使用监督微调和GRPO强化学习;3. 在标准基准上实现SOTA性能。

Method: 1. 生成连贯的Chain-of-Thought分析;2. 动作预测和历史摘要结合;3. 使用监督微调和GRPO强化学习训练模型。

Result: 在相同训练数据条件下取得SOTA结果,尤其在跨域场景中表现突出。

Insight: 结构化推理和历史摘要有助于提升GUI导航代理的泛化能力和任务理解。

Abstract: While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework’s ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.