Table of Contents

cs.CL [Back]

[1] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Jinzhi Wang,Qingke Peng,Haozhou Li,Zeyuan Zeng,Qinfeng Song,Kaixuan Yang,Jiangbo Zhang,Yaoying Wang,Ruimeng Li,Biyi Zhou

Main category: cs.CL

TL;DR: 论文提出了ElectriQ基准,用于评估大语言模型在电力营销场景中的响应能力。通过构建涵盖六类服务的对话数据集和四种评估指标,结合领域知识库和方法增强,实验表明小模型(如LLama3-8B)的优化表现可以超越GPT-4o。

Details Motivation: 当前电力营销客服系统(如中国95598热线)存在响应慢、流程僵化等问题,而大语言模型缺乏领域专业性和同理心,需针对性优化。

Contribution: 提出了首个电力营销场景的基准ElectriQ,包含对话数据集、评估指标及知识增强方法,为领域定制化模型开发提供基础。

Method: 构建六类服务对话数据集,提出四种评估指标(专业性、普及性、可读性、用户友好性),结合知识库增强模型性能。

Result: 实验表明,优化后的小模型(如LLama3-8B)在专业性和用户友好性上超越GPT-4o。

Insight: 小模型通过领域知识增强和微调,能在特定任务中超越通用大模型,突显领域定制的重要性。

Abstract: Electric power marketing customer service plays a critical role in addressing inquiries, complaints, and service requests. However, current systems, such as China’s 95598 hotline, often struggle with slow response times, inflexible procedures, and limited accuracy in domain-specific tasks. While large language models (LLMs) like GPT-4o and Claude 3 demonstrate strong general capabilities, they lack the domain expertise and empathy required in this field. To bridge this gap, we introduce ElectriQ, the first benchmark designed to evaluate and enhance LLMs in electric power marketing scenarios. ElectriQ consists of a dialogue dataset covering six key service categories and introduces four evaluation metrics: professionalism, popularity, readability, and user-friendliness. We further incorporate a domain-specific knowledge base and propose a knowledge augmentation method to boost model performance. Experiments on 13 LLMs reveal that smaller models such as LLama3-8B, when fine-tuned and augmented, can surpass GPT-4o in terms of professionalism and user-friendliness. ElectriQ establishes a comprehensive foundation for developing LLMs tailored to the needs of power marketing services.

[2] A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms

Navid Yazdanjue,Morteza Rakhshaninejad,Hossein Yazdanjouei,Mohammad Sadegh Khorshidi,Mikko S. Niemela,Fang Chen,Amir H. Gandomi

Main category: cs.CL

TL;DR: 本文提出了一种结合微调语言模型和半监督集成学习的框架,用于检测和分类深网、暗网及社交平台上的非法市场内容,通过两阶段分类和多种特征提取方法,取得了优异的性能。

Details Motivation: 非法市场活动在深网、暗网及社交平台上日益猖獗,由于数据稀疏、语言复杂且平台异构性高,检测和分类此类内容具有挑战性。

Contribution: 1) 提出了一个分层分类框架,结合语言模型和半监督集成学习;2) 通过ModernBERT提取语义表示,并结合手工特征增强模型;3) 在两阶段分类任务中表现出色,性能优于多个基线模型。

Method: 1) 使用ModernBERT提取语义表示;2) 结合手工特征(如文档结构、嵌入模式等);3) 采用两阶段分类策略,首阶段为半监督集成学习,次阶段为详细分类。

Result: 在多个数据集上的实验表明,模型准确率达0.96489,F1分数为0.93467,TMCC为0.95388,显著优于基线模型。

Insight: 结合语言模型与手工特征能有效提升模型性能;半监督集成学习在稀疏标注数据下表现出良好的鲁棒性;分层分类策略适合复杂任务。

Abstract: Illegal marketplaces have increasingly shifted to concealed parts of the internet, including the deep and dark web, as well as platforms such as Telegram, Reddit, and Pastebin. These channels enable the anonymous trade of illicit goods including drugs, weapons, and stolen credentials. Detecting and categorizing such content remains challenging due to limited labeled data, the evolving nature of illicit language, and the structural heterogeneity of online sources. This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy to detect and classify illicit marketplace content across diverse platforms. We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes to capture specialized jargon and ambiguous linguistic patterns. In addition, we incorporate manually engineered features such as document structure, embedded patterns including Bitcoin addresses, emails, and IPs, and metadata, which complement language model embeddings. The classification pipeline operates in two stages. The first stage uses a semi-supervised ensemble of XGBoost, Random Forest, and SVM with entropy-based weighted voting to detect sales-related documents. The second stage further classifies these into drug, weapon, or credential sales. Experiments on three datasets, including our multi-source corpus, DUTA, and CoDA, show that our model outperforms several baselines, including BERT, ModernBERT, DarkBERT, ALBERT, Longformer, and BigBird. The model achieves an accuracy of 0.96489, an F1-score of 0.93467, and a TMCC of 0.95388, demonstrating strong generalization, robustness under limited supervision, and effectiveness in real-world illicit content detection.

[3] Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs

Victor Eiti Yamamoto,Hideaki Takeda

Main category: cs.CL

TL;DR: 论文提出了一种集成异构知识图谱中所有三元组元素的方法,重点解决了上下文匹配这一未充分探索的问题。

Details Motivation: 现有知识图谱集成方法主要关注模式(schema)和身份(identity)匹配,而上下文(context)匹配的研究较少。由于实际知识图谱在来源、规模和信息密度上差异较大,现有方法在复杂上下文集成中表现不足。

Contribution: 1. 提出了一种结合标签匹配和三元组匹配的新方法;2. 通过字符串操作、模糊匹配和向量相似性技术对齐实体和谓词标签;3. 引入了新数据集以更全面地评估三元组匹配。

Method: 利用标签匹配(字符串操作、模糊匹配和向量相似性)对齐实体和谓词标签,然后通过三元组映射提升实体匹配的准确性。

Result: 在OAEI比赛中表现优异,相比监督方法在多样化测试案例中取得了高精度。

Insight: 上下文匹配是知识图谱集成的重要方向,结合标签和三元组匹配可以显著提升性能。

Abstract: Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information. Their main components include schema, identity, and context. While schema and identity matching are well-established in ontology and entity matching research, context matching remains largely unexplored. This is particularly important because real-world KGs often vary significantly in source, size, and information density - factors not typically represented in the datasets on which current entity matching methods are evaluated. As a result, existing approaches may fall short in scenarios where diverse and complex contexts need to be integrated. To address this gap, we propose a novel KG integration method consisting of label matching and triple matching. We use string manipulation, fuzzy matching, and vector similarity techniques to align entity and predicate labels. Next, we identify mappings between triples that convey comparable information, using these mappings to improve entity-matching accuracy. Our approach demonstrates competitive performance compared to leading systems in the OAEI competition and against supervised methods, achieving high accuracy across diverse test cases. Additionally, we introduce a new dataset derived from the benchmark dataset to evaluate the triple-matching step more comprehensively.

[4] Theoretical Foundations and Mitigation of Hallucination in Large Language Models

Esmail Gumaan

Main category: cs.CL

TL;DR: 该论文对大型语言模型(LLMs)中的幻觉问题进行了系统的理论分析,定义了幻觉风险,并提出了检测和缓解策略。通过理论框架和实验方法,为减少LLMs中的幻觉提供了理论和实践基础。

Details Motivation: 幻觉是LLMs生成不符合输入或事实内容的严重问题,限制了其实用性和可靠性。作者旨在通过理论分析和方法论探索,为这一挑战提供系统性解决方案。

Contribution: 1. 正式定义了幻觉及其分类(内在和外在幻觉)和幻觉风险;2. 使用PAC-Bayes和Rademacher复杂度推导了幻觉风险的理论界限;3. 提出了一套统一的检测和缓解工作流。

Method: 1. 理论分析:通过学习理论框架量化幻觉风险;2. 检测策略:包括token级不确定性估计、置信度校准和注意力对齐检查;3. 缓解方法:如检索增强生成、幻觉感知微调和事实验证模块集成。

Result: 论文提出了一种统一的工作流,并通过实验验证了检测和缓解策略的有效性。同时,提出了针对幻觉的评估协议,推荐数据集和指标。

Insight: 理论框架为理解幻觉提供了新视角,实践方法则为LLMs的可靠部署提供了工具。研究强调了多策略整合的重要性,以全面应对幻觉问题。

Abstract: Hallucination in Large Language Models (LLMs) refers to the generation of content that is not faithful to the input or the real-world facts. This paper provides a rigorous treatment of hallucination in LLMs, including formal definitions and theoretical analyses. We distinguish between intrinsic and extrinsic hallucinations, and define a \textit{hallucination risk} for models. We derive bounds on this risk using learning-theoretic frameworks (PAC-Bayes and Rademacher complexity). We then survey detection strategies for hallucinations, such as token-level uncertainty estimation, confidence calibration, and attention alignment checks. On the mitigation side, we discuss approaches including retrieval-augmented generation, hallucination-aware fine-tuning, logit calibration, and the incorporation of fact-verification modules. We propose a unified detection and mitigation workflow, illustrated with a diagram, to integrate these strategies. Finally, we outline evaluation protocols for hallucination, recommending datasets, metrics, and experimental setups to quantify and reduce hallucinations. Our work lays a theoretical foundation and practical guidelines for addressing the crucial challenge of hallucination in LLMs.

[5] Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey

Jindong Li,Yali Fu,Jiahong Liu,Linxiao Cao,Wei Ji,Menglin Yang,Irwin King,Ming-Hsuan Yang

Main category: cs.CL

TL;DR: 这篇论文是第一篇系统性研究离散标记化在多模态大语言模型(LLMs)中的应用的综述,提出了分类方法并分析了8种代表性向量量化(VQ)技术,讨论了其算法原理、训练动态及与LLM流程的整合挑战。

Details Motivation: 随着大语言模型的快速发展,将连续多模态数据转换为适合语言处理的离散表示的需求日益增加,但目前缺乏对这种离散标记化技术的系统综述。

Contribution: 论文填补了空白,提出了首个针对LLM的离散标记化方法的分类与分析,涵盖了8种VQ变体,并讨论了量化策略对模型性能的影响及关键挑战。

Method: 研究方法包括对经典和现代VQ技术的分类、算法分析,以及其在LLM单模态和多模态系统中的应用讨论。

Result: 研究结果展示了VQ技术在LLM中的适用性,并指出了代码本崩溃、梯度估计不稳定和模态特定编码限制等关键问题。

Insight: 未来的研究方向包括动态和任务自适应的量化、统一标记化框架和受生物学启发的代码本学习,这些方向有助于构建高效通用的多模态系统。

Abstract: The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization, with vector quantization (VQ) as a central approach, offers both computational efficiency and compatibility with LLM architectures. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. This work fills this gap by presenting the first structured taxonomy and analysis of discrete tokenization methods designed for LLMs. We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Beyond algorithm-level investigation, we discuss existing research in terms of classical applications without LLMs, LLM-based single-modality systems, and LLM-based multimodal systems, highlighting how quantization strategies influence alignment, reasoning, and generation performance. In addition, we identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Finally, we discuss emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. A continuously updated version is available at: https://github.com/jindongli-Ai/LLM-Discrete-Tokenization-Survey.

[6] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

Haoran Sun,Shaoning Zeng

Main category: cs.CL

TL;DR: 该论文提出了一种分层记忆(H-MEM)架构,用于提升大型语言模型代理(LLM Agents)的长期推理能力,通过多级语义抽象组织记忆并引入索引路由机制,显著提高了记忆检索效率。

Details Motivation: 现有LLM Agents的记忆机制在结构化组织与高效检索方面存在不足,限制了长期推理能力。论文旨在通过分层记忆架构解决这些问题。

Contribution: 论文的主要贡献是提出H-MEM架构,通过多级语义抽象组织记忆,并设计索引路由机制以高效检索记忆。

Method: 采用分层记忆(H-MEM)架构,记忆按语义抽象程度分层组织,每层记忆向量嵌入位置索引编码,推理时通过索引路由机制逐层检索。

Result: 在LoCoMo数据集的五项任务中,H-MEM均优于五种基线方法,验证了其在长期对话场景中的有效性。

Insight: 多级记忆组织和索引路由机制可显著提升LLM Agents的记忆检索效率和长期推理能力。

Abstract: Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H-MEM) architecture for LLM Agents that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.

[7] Multi-Relation Extraction in Entity Pairs using Global Context

Nilesh,Atul Gupta,Avinash C Panday

Main category: cs.CL

TL;DR: 本文提出了一种新颖的输入嵌入方法,通过捕获文档中实体出现的位置来构建全局上下文,从而在文档级关系抽取中更准确地预测实体间的关系。

Details Motivation: 现有方法仅关注实体提及的句子,无法捕捉文档全局上下文,导致关系抽取不准确。

Contribution: 提出了一种新的输入嵌入方法,利用全局关系和跨句子推理来提升文档级关系抽取的性能。

Method: 通过将实体表示为独立于位置的段落,捕获实体在整个文档中的位置信息。

Result: 在DocRED、Re-DocRED和REBEL三个基准数据集上验证了方法的有效性。

Insight: 全局上下文建模和多句子推理对文档级关系抽取具有重要意义。

Abstract: In document-level relation extraction, entities may appear multiple times in a document, and their relationships can shift from one context to another. Accurate prediction of the relationship between two entities across an entire document requires building a global context spanning all relevant sentences. Previous approaches have focused only on the sentences where entities are mentioned, which fails to capture the complete document context necessary for accurate relation extraction. Therefore, this paper introduces a novel input embedding approach to capture the positions of mentioned entities throughout the document rather than focusing solely on the span where they appear. The proposed input encoding approach leverages global relationships and multi-sentence reasoning by representing entities as standalone segments, independent of their positions within the document. The performance of the proposed method has been tested on three benchmark relation extraction datasets, namely DocRED, Re-DocRED, and REBEL. The experimental results demonstrated that the proposed method accurately predicts relationships between entities in a document-level setting. The proposed research also has theoretical and practical implications. Theoretically, it advances global context modeling and multi-sentence reasoning in document-level relation extraction. Practically, it enhances relationship detection, enabling improved performance in real-world NLP applications requiring comprehensive entity-level insights and interpretability.

[8] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Zhehao Tan,Yihan Jiao,Dan Yang,Lei Liu,Jie Feng,Duolin Sun,Yue Shen,Jian Wang,Peng Wei,Jinjie Gu

Main category: cs.CL

TL;DR: 论文提出了PRGB基准,用于细粒度评估检索增强生成(RAG)中语言模型的能力,通过多维度分析和占位符方法解耦模型参数知识与外部知识。

Details Motivation: 现有RAG基准多关注系统整体性能,缺乏对语言模型能力的细粒度评估,尤其是文档利用能力。

Contribution: 提出多级细粒度基准PRGB,强调过滤、组合和参考推理能力;提出占位符方法解耦语言模型与外部知识的贡献。

Method: 基于占位符的方法设计多维度评估框架,包括过滤、组合和参考推理能力测试。

Result: 实验表明当前语言模型在RAG中生成能力有限,尤其在错误恢复和上下文忠实性上表现不足。

Insight: PRGB为开发更可靠高效的RAG系统提供了可复现的评估框架,突出了模型能力的细粒度分析价值。

Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM’s ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs’ roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM’s parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system’s generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB.

[9] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Xi Chen,Aske Plaat,Niki van Stein

Main category: cs.CL

TL;DR: 该论文通过稀疏自编码和激活修补技术,研究了链式思考(CoT)提示在语言模型中的内部机制,发现高容量模型(如Pythia-2.8B)中的CoT特征更模块化且可解释。

Details Motivation: 尽管链式思考(CoT)提示在多步任务中提升了语言模型的准确率,但其生成的‘思考’是否真实反映内部推理过程尚不明确,论文旨在通过因果关系研究回答这一问题。

Contribution: 首次在特征层面研究了CoT的忠实性,揭示了高容量模型中CoT特征的模块化特性和可解释性,并提出了两种新的评估方法(patch-curves和随机特征修补)。

Method: 结合稀疏自编码器和激活修补技术,从Pythia模型中提取特征,通过比较CoT和普通提示(noCoT)下的特征表现来评估CoT的忠实性。

Result: 在Pythia-2.8B模型中,CoT特征的引入显著提升了回答的对数概率(从1.2到4.3),同时提高了激活稀疏性和特征可解释性得分。

Insight: CoT提示在高容量语言模型中更有效,能够诱导更模块化和可解释的内部结构,表明其作为结构化提示方法的有效性。

Abstract: Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated “thoughts” reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model’s confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

[10] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Xiaoyu Pan,Yang Bai,Ke Zou,Yang Zhou,Jun Zhou,Huazhu Fu,Yih-Chung Tham,Yong Liu

Main category: cs.CL

TL;DR: 论文提出了EH-Benchmark,专注于评估眼科大语言模型(MLLMs)中的幻觉问题,并通过多阶段代理驱动框架显著减少幻觉,提升诊断的准确性和可靠性。

Details Motivation: 现有眼科MLLMs因知识不足、视觉定位与推理能力有限以及数据稀缺,导致幻觉问题严重,影响疾病诊断的精确性,而目前的医学基准无法有效评估或解决这些问题。

Contribution: 提出了EH-Benchmark,首次将MLLMs的幻觉分为视觉理解和逻辑组合两大类及子类,并设计了一个包含知识检索、任务案例分析和结果验证的三阶段代理驱动框架。

Method: 采用多阶段代理驱动框架,分知识级检索、任务级案例分析和结果级验证三个阶段,逐步优化MLLMs的推理能力以减少幻觉。

Result: 实验表明,该框架显著降低了两种类型的幻觉,提高了模型的准确性、可解释性和可靠性。

Insight: 通过任务和错误类型对幻觉进行分类,并结合多阶段代理框架,为解决MLLMs在医学领域的幻觉问题提供了新思路。

Abstract: Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs’ hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.

[11] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Shalini Jangra,Suparna De,Nishanth Sastry,Saeed Fadaei

Main category: cs.CL

TL;DR: 论文提出了一种生成合成数据集的方法,用于检测社交媒体中自我披露的个人信息(PII),以解决现有标记数据不足的问题。通过三种LLMs生成合成数据,并验证其与原数据的可比较性、不可链接性和不可区分性。

Details Motivation: 社交媒体中存在大量用户自我披露的个人信息(PII),这些信息可能导致隐私风险和网络危害。但由于缺乏开源标记数据集,相关研究受到限制。因此,需要一种安全共享的合成数据生成方法。

Contribution: 1. 提出了19类PII暴露的分类法;2. 基于三种LLMs生成合成PII标记数据集;3. 验证了合成数据的实用性(可比较性、不可链接性、不可区分性)。

Method: 使用Llama2-7B、Llama3-8B和zephyr-7b-beta三种大语言模型,通过顺序指令提示生成合成数据。通过三项指标评估合成数据的质量:可比较性、不可链接性和不可区分性。

Result: 生成的合成数据集在实用性测试中表现良好,能够替代原始数据用于模型训练,同时保护用户隐私。

Insight: 合成数据生成技术为隐私敏感研究提供了可行解决方案,特别是在缺乏标记数据时,能有效支持可重复性研究。

Abstract: Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users’ Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.

[12] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

Baptiste Lefort,Eric Benhamou,Beatrice Guez,Jean-Jacques Ohana,Ethan Setrouk,Alban Etienne

Main category: cs.CL

TL;DR: 本文提出了一种新颖的分层框架用于投资组合优化,结合轻量级大语言模型(LLMs)和深度强化学习(DRL),整合金融新闻的情感情报与传统市场指标。

Details Motivation: 希望通过整合金融新闻的情感和传统市场数据,提升投资组合优化的性能,同时解决多模态数据融合的挑战。

Contribution: 主要贡献包括:1)可扩展的跨模态数据整合方法;2)分层强化学习架构以提升稳定性;3)开源实现以促进可复现性。

Method: 采用三层架构:基础RL代理处理混合数据,元代理聚合决策,超级代理结合市场数据和情感分析做出最终决策。

Result: 在2018-2024年的测试数据上,年化收益率为26%,夏普比率为1.2,优于等权重和标普500基准。

Insight: 情感分析与市场数据结合能显著提升投资性能,分层强化学习架构有助于稳定性和可扩展性。

Abstract: This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and S&P 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.

[13] Augmented Vision-Language Models: A Systematic Review

Anthony C Davis,Burhan Sadiq,Tianmin Shu,Chien-Ming Huang

Main category: cs.CL

TL;DR: 本文是对增强视觉-语言模型的系统性综述,探讨如何通过结合外部符号信息系统提升视觉-语言理解能力,解决传统模型在可解释性、适应性和逻辑推理方面的局限性。

Details Motivation: 传统的视觉-语言模型虽然在大规模无监督数据上表现优异,但存在可解释性差、难以动态更新数据和逻辑推理能力弱等问题。通过结合神经符号系统,可以为模型提供更强的推理和记忆能力。

Contribution: 本文的主要贡献是对结合外部符号信息系统增强视觉-语言模型的技术进行了系统分类,并总结了神经符号系统在提升模型能力方面的优势。

Method: 论文通过系统性文献综述的方式,分析了利用预训练的视觉-语言模型(VLMs)作为核心神经网络组件,并整合外部符号系统的技术路径。

Result: 综述发现,结合外部符号信息系统的神经符号模型能够显著提升模型的可解释性、适应性和逻辑推理能力。

Insight: 神经符号系统的结合为解决传统视觉-语言模型的局限性提供了一种实用且高效的解决方案,尤其是在动态信息更新和复杂推理任务中表现突出。

Abstract: Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.

[14] Deep Learning Approaches for Multimodal Intent Recognition: A Survey

Jingwei Zhao,Yuhua Wen,Qifei Li,Minchi Hu,Yingying Zhou,Jingyao Xue,Junyang Wu,Yingming Gao,Zhengqi Wen,Jianhua Tao,Ya Li

Main category: cs.CL

TL;DR: 这篇论文综述了深度学习在多模态意图识别(MIR)中的应用,涵盖了从单模态到多模态的技术转变、数据集、方法、应用及当前挑战。

Details Motivation: 随着人机交互的自然需求增长,意图识别从传统的文本扩展到多模态数据(如音频、视觉和生理信号),深度学习尤其是基于Transformer的模型成为关键推动力。

Contribution: 论文全面总结了深度学习在多模态意图识别领域的最新进展,为研究人员提供了技术发展和未来方向的系统指南。

Method: 通过文献综述的方式,分析了从单模态到多模态意图识别的方法演变,重点关注深度学习和Transformer模型的应用。

Result: 归纳了多模态意图识别的现有技术、数据集和性能表现,同时指出当前研究的局限性和未解决问题。

Insight: 多模态数据融合和Transformer模型是推动意图识别发展的关键,但跨模态对齐和标注数据稀缺仍是主要挑战。

Abstract: Intent recognition aims to identify users’ underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.

[15] Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Kathleen Mealey,Jonathan A. Karr Jr.,Priscila Saboia Moreira,Paul R. Brenner,Charles F. Vardeman II

Main category: cs.CL

TL;DR: 论文探讨了从组织数据中提取运维智能的挑战,提出了知识图谱构建方法,并评估了NLP工具与大型语言模型的性能,聚焦于航空业的可信应用。

Details Motivation: 组织数据在保密性与集成性之间存在矛盾,且NLP工具在运维领域表现有限,推动了可信知识提取的研究。

Contribution: 1. 提出知识提取的功能组件(如NER、关系提取等);2. 评估了16种NLP工具与LLM的零样本性能;3. 提供了开源数据集支持基准测试。

Method: 将知识提取过程分解为实体识别、共指消解等功能组件,并使用FAA数据集评估NLP工具和LLM的性能。

Result: 发现现有工具在性能上存在显著限制,讨论了可信NLP和LLM的挑战及技术成熟度。

Insight: 在航空等关键行业中,可信NLP和LLM工具的技术成熟度仍需提升,需进一步优化以满足任务关键需求。

Abstract: Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

[16] A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

Sumit Soman,H. G. Ranjani,Sujoy Roychowdhury,Venkata Dharma Surya Narayana Sastry,Akshat Jain,Pranav Gangrade,Ayaaz Khan

Main category: cs.CL

TL;DR: 该论文提出了一种基于图的方法,用于从电信文档中的流程图进行多模态问答(QA)。通过利用视觉大语言模型(VLMs)生成的流程图图表示,并将其整合到基于文本的RAG系统中,实现了图像检索的功能,同时降低了推理阶段的成本。

Details Motivation: 技术文档中的问答通常涉及流程图中的信息,而传统的文本检索增强生成(RAG)系统难以处理此类问题。因此,需要一种结合图像和文本的多模态方法来解决这一挑战。

Contribution: 1. 提出了一种端到端的流程,将流程图转换为图表示并整合到文本嵌入管道中。2. 展示了基于VLM生成的图表示在问答任务中的有效性,并强调了其在电信领域的适用性。3. 降低了对VLM推理的依赖,减少了部署成本。

Method: 1. 使用VLM对流程图进行分类和图表示生成。2. 将生成的图表示与文本嵌入模型结合,构建多模态检索系统。3. 在专有电信文档数据集上进行了实验验证。

Result: 图表示与真实标签的编辑距离更低,证明了其鲁棒性。在问答任务中,文本嵌入模型结合图表示取得了良好的检索性能,验证了方法的有效性。

Insight: 多模态表示(尤其是图结构)能够有效捕捉流程图中的信息,从而提升问答系统的性能。同时,减少对昂贵VLM推理的依赖为实际部署提供了成本优势。

Abstract: Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.

[17] PARROT: An Open Multilingual Radiology Reports Dataset

Bastien Le Guellec,Kokou Adambounou,Lisa C Adams,Thibault Agripnidis,Sung Soo Ahn,Radhia Ait Chalal,Tugba Akinci D Antonoli,Philippe Amouyel,Henrik Andersson,Raphael Bentegeac,Claudio Benzoni,Antonino Andrea Blandino,Felix Busch,Elif Can,Riccardo Cau,Armando Ugo Cavallo,Christelle Chavihot,Erwin Chiquete,Renato Cuocolo,Eugen Divjak,Gordana Ivanac,Barbara Dziadkowiec Macek,Armel Elogne,Salvatore Claudio Fanni,Carlos Ferrarotti,Claudia Fossataro,Federica Fossataro,Katarzyna Fulek,Michal Fulek,Pawel Gac,Martyna Gachowska,Ignacio Garcia Juarez,Marco Gatti,Natalia Gorelik,Alexia Maria Goulianou,Aghiles Hamroun,Nicolas Herinirina,Krzysztof Kraik,Dominik Krupka,Quentin Holay,Felipe Kitamura,Michail E Klontzas,Anna Kompanowska,Rafal Kompanowski,Alexandre Lefevre,Tristan Lemke,Maximilian Lindholz,Lukas Muller,Piotr Macek,Marcus Makowski,Luigi Mannacio,Aymen Meddeb,Antonio Natale,Beatrice Nguema Edzang,Adriana Ojeda,Yae Won Park,Federica Piccione,Andrea Ponsiglione,Malgorzata Poreba,Rafal Poreba,Philipp Prucker,Jean Pierre Pruvo,Rosa Alba Pugliesi,Feno Hasina Rabemanorintsoa,Vasileios Rafailidis,Katarzyna Resler,Jan Rotkegel,Luca Saba,Ezann Siebert,Arnaldo Stanzione,Ali Fuat Tekin,Liz Toapanta Yanchapaxi,Matthaios Triantafyllou,Ekaterini Tsaoulia,Evangelia Vassalou,Federica Vernuccio,Johan Wasselius,Weilang Wang,Szymon Urban,Adrian Wlodarczak,Szymon Wlodarczak,Andrzej Wysocki,Lina Xu,Tomasz Zatonski,Shuhang Zhang,Sebastian Ziegelmayer,Gregory Kuchcinski,Keno K Bressem

Main category: cs.CL

TL;DR: PARROT是一个多语言、开放获取的放射学报告数据集,用于测试自然语言处理(NLP)应用,包含2658份虚构报告,覆盖13种语言和多种成像模态。

Details Motivation: 解决放射学NLP应用中多语言数据和隐私限制的缺乏问题,提供开放的测试资源。

Contribution: 创建了最大的开放多语言放射学报告数据集PARROT,支持跨语言和地理的NLP研究。

Method: 邀请放射科医生贡献虚构报告,标注元数据,并进行人机报告区分研究。

Result: 数据集包含2658份报告,覆盖多模态和多语言,人机区分准确率为53.9%,放射科医生表现更好。

Insight: 虚构报告可用于NLP测试而不侵犯隐私,多语言数据促进全球化NLP应用发展。

Abstract: Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p<0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.

[18] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

Rui Jiao,Yue Zhang,Jinku Li

Main category: cs.CL

TL;DR: RELIANCE框架通过专用事实验证分类器、多维奖励强化学习及模型激活分析,显著提升大语言模型中间推理步骤的事实准确性,同时保持或提升性能。

Details Motivation: 大语言模型在中间推理步骤中存在事实错误,尽管最终答案可能正确,这在医疗、法律等高风险领域可能导致误导性决策,亟需提升推理的事实准确性。

Contribution: 1. 提出针对推理链条中事实不一致的专门分类器;2. 开发GRPO强化学习方法来平衡事实性、连贯性和结构正确性;3. 通过激活分析揭示事实性改进的机制。

Method: 1. 基于反事实增强数据训练事实分类器;2. 利用GRPO多维奖励优化模型;3. 分析模型激活以理解事实性改进的表现方式。

Result: RELIANCE将模型的事实准确性提升高达49.90%,同时在Math-500等基准测试中保持或改进性能。

Insight: 激活分析揭示了事实性改进如何改变模型推理路径,为未来通过激活引导优化的训练方法奠定了基础。

Abstract: We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.

[19] SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

Paul Minchella,Loïc Verlingue,Stéphane Chrétien,Rémi Vaucher,Guillaume Metzler

Main category: cs.CL

TL;DR: SigBERT结合医学报告和粗糙路径签名理论,提出一种时序生存分析框架,通过提取文本嵌入和路径特征提升风险估计性能。

Details Motivation: 电子医疗报告包含丰富信息,但现有生存分析方法难以有效处理其复杂时序性,SigBERT旨在解决这一问题。

Contribution: 提出SigBERT框架,首次将粗糙路径签名理论应用于医学时序文本数据,显著提升生存风险预测性能。

Method: 提取词嵌入并平均为句嵌入,利用粗糙路径签名理论捕捉时序动态特征,结合LASSO-Cox模型进行风险评分。

Result: 在真实肿瘤数据集上达到C-index 0.75,验证了方法的有效性。

Insight: 粗糙路径签名理论能有效捕捉医学文本的时序动态,为生存分析提供新思路。

Abstract: Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the L'eon B'erard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.

[20] A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

Shirley V Wang,Georg Hahn,Sushama Kattinakere Sreedhara,Mufaddal Mahesri,Haritha S. Pillai,Rajendra Aldis,Joyce Lii,Sarah K. Dutcher,Rhoda Eniafe,Jamal T. Jones,Keewan Kim,Jiwei He,Hana Lee,Sengwee Toh,Rishi J Desai,Jie Yang

Main category: cs.CL

TL;DR: 该论文提出了一种通过自然语言处理(NLP)和多波自适应抽样加速验证基于编码的算法的流程,以减少大型数据库研究中人工标注的时间。

Details Motivation: 传统的手动标注电子健康记录(EHR)需要大量时间和资源,限制了编码算法在大规模数据库研究中的验证效率。

Contribution: 主要贡献包括:1)利用NLP减少人工标注时间;2)采用多波自适应抽样和预定义停止规则,显著减少需要标注的病例数量。

Method: 方法包括:1)NLP辅助标注;2)多波自适应抽样,结合预定义停止规则,确保性能指标达到足够精度后停止标注。

Result: 实验表明,NLP辅助标注时间减少40%,停止规则可避免77%的病例标注,且对性能指标的精度影响有限。

Insight: 该流程能显著提升编码算法验证的效率,为大型数据库研究的可靠性评估提供了实用工具。

Abstract: Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.

[21] Opacity as Authority: Arbitrariness and the Preclusion of Contestation

Naomi Omeonga wa Kayembe

Main category: cs.CL

TL;DR: 本文重新定义任意性,认为其并非规范缺陷或支配症状,而是构建人类系统与互动的基础功能机制。

Details Motivation: 现有批判传统将任意性与不公正混为一谈,而本文将其视为符号学特征,揭示其在语言、法律和社会系统中的功能作用。

Contribution: 提出“动机->可证实性->可争议性”链理论,形式化任意性为条件熵A = H(L|M),并探讨其在解释人工智能系统中的应用。

Method: 基于索绪尔的符号任意性理论,扩展至法律和社会动态分析,引入熵模型量化任意性。

Result: 揭示了任意性作为结构不透明性的设计逻辑,保护权威免于问责,同时为AI可解释性研究提供新视角。

Insight: 任意性是中性的控制工具,既用于权威维护,也用于人际关怀,这一发现为跨领域系统分析开辟了新路径。

Abstract: This article redefines arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. Diverging from critical traditions that conflate arbitrariness with injustice, it posits arbitrariness as a semiotic trait: a property enabling systems - linguistic, legal, or social - to operate effectively while withholding their internal rationale. Building on Ferdinand de Saussure’s concept of l’arbitraire du signe, the analysis extends this principle beyond language to demonstrate its cross-domain applicability, particularly in law and social dynamics. The paper introduces the “Motivation -> Constatability -> Contestability” chain, arguing that motivation functions as a crucial interface rendering an act’s logic vulnerable to intersubjective contestation. When this chain is broken through mechanisms like “immotivization” or “Conflict Lateralization” (exemplified by “the blur of the wolf drowned in the fish”), acts produce binding effects without exposing their rationale, thus precluding justiciability. This structural opacity, while appearing illogical, is a deliberate design protecting authority from accountability. Drawing on Shannon’s entropy model, the paper formalizes arbitrariness as A = H(L|M) (conditional entropy). It thereby proposes a modern theory of arbitrariness as a neutral operator central to control as well as care, an overlooked dimension of interpersonal relations. While primarily developed through human social systems, this framework also illuminates a new pathway for analyzing explainability in advanced artificial intelligence systems.

[22] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

Ananya Sadana,Yash Kumar Lal,Jiawei Zhou

Main category: cs.CL

TL;DR: ISO-Bench是一个新的基准测试,用于评估多模态模型在视觉和文本之间因果推理的能力。现有的前沿视觉-语言模型在这一任务上表现不佳,最佳零样本F1分数仅为0.57,远低于人类水平(0.98)。

Details Motivation: 理解跨模态的因果关系是多模态模型在真实环境中的核心挑战。当前模型在这方面的能力尚未被充分评估,因此需要一个专门的基准测试来填补这一空白。

Contribution: 本文的主要贡献是提出了ISO-Bench,一个专注于评估多模态模型在视觉-文本因果推理能力的基准测试。同时,通过分析揭示了当前模型的不足和改进方向。

Method: ISO-Bench通过呈现任务步骤的图像和计划中的文本片段,要求模型判断视觉步骤是在文本步骤之前还是之后。评估了十种前沿视觉-语言模型的零样本和链式思维推理能力。

Result: 当前模型表现不佳,最佳零样本F1分数为0.57,链式思维推理仅提升至0.62,远低于人类的0.98。

Insight: 研究表明,多模态模型在跨模态因果推理方面仍有很大改进空间,未来的研究可以关注如何更好地结合视觉和文本信息以提升推理能力。

Abstract: Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.

[23] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Jianghui Wang,Vinay Joshi,Saptarshi Majumder,Xu Chao,Bin Ding,Ziqiong Liu,Pratik Prabhanjan Brahma,Dong Li,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: 论文介绍了GEAK框架,利用前沿大语言模型(LLM)为AMD GPU生成高性能Triton代码,并通过推理时计算缩放实现显著的性能提升。

Details Motivation: 随着深度学习工作负载的复杂性和多样性增加,需要自动化低层内核开发以满足性能和生产力需求。AI驱动的GPU代码生成成为行业和学术界关注的焦点。

Contribution: 提出了GEAK框架,结合推理时计算缩放和Reflexion反馈机制,为AMD GPU生成高性能Triton代码,并在性能上显著优于现有方法。

Method: GEAK利用LLM和Reflexion风格反馈机制,通过推理时计算缩放生成Triton代码,专门优化AMD MI300X和MI250 GPU。

Result: GEAK在正确性上达到63%,执行速度提升高达2.59倍,显著优于直接使用LLM或Reflexion流水线的基准方法。

Insight: GEAK展示了基于代理的代码生成在加速多样化硬件平台采用和提升内核性能方面的潜力。

Abstract: The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.

[24] Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs

Sophie Kearney,Shu Yang,Zixuan Wen,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Jason Moore,Marylyn Ritchie,Li Shen

Main category: cs.CL

TL;DR: 该论文提出了一个名为TAP-GPT的新框架,利用TableGPT2模型和few-shot学习方法,通过结构化生物标志物数据实现阿尔茨海默病(AD)的早期诊断。

Details Motivation: 阿尔茨海默病的早期诊断依赖于复杂的生物标志物分析,LLMs凭借其多模态整合和few-shot推理能力,为解决这一问题提供了新途径。

Contribution: 提出了TAP-GPT框架,首次将LLMs应用于基于生物标志物数据的预测任务,并在小样本条件下表现优于其他先进模型。

Method: 通过结合in-context学习和参数高效的qLoRA微调技术,将TableGPT2模型适配到AD诊断任务中。

Result: TAP-GPT在AD诊断任务中优于通用LLMs和专门的表格基础模型(TFM)。

Insight: 展示了LLMs在结构化生物医学数据分析中的潜力,为未来多代理框架的开发铺平了道路。

Abstract: Early and accurate diagnosis of Alzheimer’s disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer’s Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.

[25] P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication

Sneha Oram,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型(LLMs)在心理健康领域中的语用推理能力,提出了P-ReMe数据集,并重新定义了隐含意义(implicature)和预设(presupposition)的语用现象。实验表明,Mistral和Qwen在该领域表现优异。此外,还研究了LLMs对心理健康污名的处理,发现Claude-3.5-haiku表现更负责任。

Details Motivation: 心理健康领域的个性化聊天机器人和可解释性技术发展迅速,但语用推理和对话话语的推理能力尚未被充分研究。论文旨在填补这一空白。

Contribution: 提出了P-ReMe数据集和针对心理健康领域的语用推理任务定义;设计了两个隐含意义任务和一个预设任务;评估了多款LLMs的表现,并研究了其对心理健康污名的处理。

Method: 使用定义的任务和数据集,对Llama3.1、Mistral、MentaLLaMa和Qwen等LLMs进行评估;通过StiPRompts研究LLMs对污名的处理。

Result: Mistral和Qwen在语用推理任务中表现突出;Claude-3.5-haiku在处理心理健康污名时比其他模型更负责任。

Insight: LLMs在心理健康领域具有一定的语用推理能力,但不同模型的表现差异显著;对污名的处理需要更具社会责任感的模型设计。

Abstract: There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.

[26] Unveiling Super Experts in Mixture-of-Experts Large Language Models

Zunhai Su,Qingyuan Li,Hao Zhang,YuLei Qian,Yuchen Xie,Kehong Yuan

Main category: cs.CL

TL;DR: 该论文发现并研究了MoE大语言模型中的一类关键专家(Super Experts,SEs),揭示了它们在模型推理中的重要作用及其对性能的显著影响。

Details Motivation: 现有MoE LLMs的专家级压缩技术多依赖经验标准,缺乏对专家异质性重要性的深入理解。本研究旨在探索和验证模型推理中关键专家的存在及其作用机制。

Contribution: 首次发现并研究了MoE LLMs中的Super Experts(SEs),揭示了其在模型推理中的关键作用;提出SEs的激活异常特性和对任务性能的显著影响;验证了SEs在注意力分配中的重要性。

Method: 通过分析专家激活异常、实验性剪枝SEs评估性能影响,以及对注意力机制的作用研究,系统探索SEs的特性和影响。

Result: SEs的剪枝会导致模型性能显著下降(如数学推理能力受损),并扰乱注意力分布;SEs的存在对模型任务表现具有关键作用。

Insight: MoE LLMs依赖SEs实现注意力分配等关键机制,SEs的异质性特性为模型压缩和优化提供了新视角。

Abstract: Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model’s forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model’s overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.

[27] What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Alfio Ferrara,Sergio Picascia,Laura Pinnavaia,Vojimir Ranitovic,Elisabetta Rocchetti,Alice Tuveri

Main category: cs.CL

TL;DR: 该论文实证分析了GPT-4o-mini在重新表述敏感内容时的隐式过滤行为,发现其对敏感内容进行了系统性的弱化处理,并评估了LLMs在零样本条件下对句子敏感性的分类能力。

Details Motivation: 尽管已有研究专注于显式训练模型以过滤敏感内容,但对LLMs是否会在无显式指令下隐式过滤语言的探索较少。本文旨在填补这一空白。

Contribution: 实证评估了GPT-4o-mini对敏感内容的隐式过滤行为,发现其对贬义和禁忌语言的显著减少;同时测试了LLMs在零样本条件下对句子敏感性的分类能力。

Method: 通过实验分析GPT-4o-mini在重新表述敏感内容时的行为,量化其敏感性降低的程度;并对比LLMs与传统方法在句子敏感性分类上的表现。

Result: GPT-4o-mini对敏感内容进行了系统性弱化处理,贬义和禁忌语言显著减少;LLMs在零样本分类任务中表现优于传统方法。

Insight: LLMs无需显式训练即可隐式过滤敏感内容,展现出‘自我审查’能力,这为未来内容审核技术提供了新方向。

Abstract: Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.

[28] Text-to-SQL Task-oriented Dialogue Ontology Construction

Renato Vukovic,Carel van Niekerk,Michael Heck,Benjamin Ruppik,Hsien-Chin Lin,Shutong Feng,Nurul Lubis,Milica Gasic

Main category: cs.CL

TL;DR: 论文提出TeQoDO方法,利用大语言模型的SQL编程能力,在无监督情况下构建面向任务的对话本体,提升解释性和可控性。

Details Motivation: 现有方法依赖手动标注或有监督训练构建本体,限制了可扩展性和效率。大语言模型的参数化知识缺乏解释性和可信度,需结合外部数据库结构。

Contribution: 提出TeQoDO方法,首次实现无监督构建任务导向对话本体,结合SQL能力和对话理论,性能优于迁移学习,并验证了本体在下游任务中的有效性。

Method: 利用大语言模型的SQL编程能力,通过提示注入对话理论,自主构建本体。无需标注或监督,支持大规模本体生成。

Result: 在对话状态跟踪任务中表现优异,扩展实验证明其在维基百科和ArXiv数据集上的可扩展性。

Insight: 对话理论在提示设计中对本体构建至关重要,为提升大语言模型解释性提供了新思路。

Abstract: Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.

[29] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

Yiyan Ji,Haoran Chen,Qiguang Chen,Chengyue Wu,Libo Qin,Wanxiang Che

Main category: cs.CL

TL;DR: 该论文提出了MPCC基准测试,首次系统评估多模态大语言模型在复杂约束下的规划能力。实验显示现有模型在多种约束下表现不佳,突显了约束感知推理的重要性。

Details Motivation: 当前基准测试无法直接评估多模态规划能力,且缺乏跨模态的复杂约束。MPCC旨在解决这些问题,推动多模态规划研究的进展。

Contribution: 1. 提出首个评估多模态规划能力的基准测试MPCC;2. 引入复杂约束(预算、时间、空间)并分级难度;3. 发现现有模型在约束条件下的局限性。

Method: 设计了三个真实任务(飞行、日历、会议规划),并引入分级的复杂约束(EASY/MEDIUM/HARD)。在13个先进MLLMs上进行了实验。

Result: 闭源模型仅生成21.3%的可行计划,开源模型平均低于11%。模型对约束复杂度敏感,传统多模态提示策略在多约束场景下失败。

Insight: 实际应用中需改进MLLMs的约束感知推理能力;MPCC为多模态规划研究提供了标准化评估框架。

Abstract: Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs’ ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.

[30] Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Ailiang Lin,Zhuoyun Li,Kotaro Funakoshi

Main category: cs.CL

TL;DR: Causal2Vec 是一种改进的解码器专用大语言模型(LLM)嵌入方法,通过预编码上下文信息和优化隐藏状态池化,显著提升语义嵌入性能,同时降低计算开销。

Details Motivation: 现有方法在去除因果注意力掩码或依赖额外输入文本时,可能牺牲语义提取能力或增加计算成本,因此需要一种既保持高效又能提升嵌入性能的解决方案。

Contribution: 提出了 Causal2Vec,通过轻量级BERT预编码上下文信息并优化池化策略,在不改变LLM架构的情况下显著提升嵌入性能,同时大幅减少序列长度和推理时间。

Method: 使用BERT预编码输入文本为单一上下文标记,并将其前置到LLM输入序列中;通过拼接上下文标记和EOS标记的隐藏状态作为最终嵌入,减轻最近偏差。

Result: 在MTEB基准测试中达到SOTA性能,相比最优方法减少85%序列长度和82%推理时间。

Insight: 通过轻量级预编码和优化池化策略,可以显著提升解码器专用LLM的嵌入能力,而无需牺牲效率或增加计算负担。

Abstract: Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model’s ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM’s input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.

[31] Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Ante Wang,Yujie Lin,Jingyao Liu,Suhang Wu,Hao Liu,Xinyan Xiao,Jinsong Su

Main category: cs.CL

TL;DR: 论文提出了一种名为”主动批判性思考”的新范式,要求AI模型主动向用户请求缺失或澄清信息以更好地解决问题。为此,研究者开发了两个新基准GSM-MC和GSM-MCE,并证明强化学习能显著提升模型在此任务上的表现。

Details Motivation: 现有的批判性思维研究主要关注被动拒绝问题查询,而忽略了模型主动解决问题的能力提升。为此,研究者提出主动批判性思维,以促进更有效的人机协作。

Contribution: 1. 提出了主动批判性思维的范式;2. 设计了GSM-MC和GSM-MCE两个新基准;3. 通过强化学习显著提升了模型在主动批判性任务上的表现。

Method: 1. 开发了基于GSM8K的GSM-MC和GSM-MCE基准;2. 评估了Qwen3和Llama系列模型的性能;3. 使用强化学习优化了模型的主动提问能力。

Result: 强化学习显著提升了模型在GSM-MC上的准确率,例如Qwen3-1.7B的准确率从0.15%提升到73.98%。

Insight: 主动批判性思维是提升AI与人类协作能力的关键方向,强化学习在此任务上显示了极大潜力。

Abstract: Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B’s accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.

[32] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri,Yerulan Kongrat,Adrian Santosh,Ruslan Tasmukhanov,Josemaria Vera,Muhammad Dehan Al Kautsar,Fajri Koto

Main category: cs.CL

TL;DR: 该论文探讨了在组织环境中通过微调大型语言模型(LLMs)以实现基于用户角色的访问控制。提出了三种建模策略,并通过构建两个数据集评估了模型的性能和对安全威胁的鲁棒性。

Details Motivation: 现有的大型语言模型安全方法通常假设统一的访问权限,而未考虑角色特定的访问约束。在组织中,基于角色的访问控制对模型行为的安全性和上下文适应性提出了需求。

Contribution: 提出了一种通过微调LLMs实现角色敏感访问控制的方法,探索了三种建模策略(BERT分类器、LLM分类器和角色条件生成),并构建了两种数据集用于评估。

Method: 采用了三种建模策略:1)BERT分类器判断用户角色;2)LLM分类器直接预测访问权限;3)角色条件生成模型根据用户角色动态调整输出。使用了两种数据集(改编的指令调优数据和合成的企业场景数据)进行实验。

Result: 评估了模型在不同组织结构和安全威胁(如提示注入、角色不匹配和越狱攻击)下的表现,分析了各策略的优劣。

Insight: 研究表明,通过微调大型语言模型可以实现基于角色的访问控制,但模型的鲁棒性和泛化能力仍需进一步提升。角色条件生成方法在灵活性上表现较好,但在安全性方面可能需要更强的防御机制。

Abstract: As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.

[33] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang,Zhihui Tang,Huaxia Yang,Qiuhong Gong,Tiantian Gu,Hongyang Ma,Yongxin Wang,Wubin Sun,Zeliang Lian,Kehang Mao,Yinan Jiang,Zhicheng Huang,Lingyun Ma,Wenjie Shen,Yajie Ji,Yunhui Tan,Chunbo Wang,Yunlu Gao,Qianling Ye,Rui Lin,Mingyu Chen,Lijuan Niu,Zhihao Wang,Peng Yu,Mengran Lang,Yue Liu,Huimin Zhang,Haitao Shen,Long Chen,Qiguang Zhao,Si-Xuan Liu,Lina Zhou,Hua Gao,Dongqiang Ye,Lingmin Meng,Youtao Yu,Naixin Liang,Jianxiong Wu

Main category: cs.CL

TL;DR: 论文提出了临床安全-有效性双轨基准(CSEDB),用于评估医疗大型语言模型(LLM)的安全性和有效性,通过临床专家共识开发了30个标准,测试结果显示领域专用医疗LLM优于通用模型,尤其在安全性和有效性方面表现更优。

Details Motivation: 尽管大型语言模型在临床决策支持中具有潜力,但其安全性和有效性的评估仍面临重大挑战,缺乏标准化基准。

Contribution: 提出了CSEDB基准,覆盖30个临床关键领域,基于专家共识开发了2069个问答项,为医疗LLM的评估提供了标准化工具。

Method: 通过32位专科医生开发并审核2069个开放式问答,模拟真实场景,测试了6个LLM的安全性和有效性表现。

Result: 测试结果显示医疗LLM平均总分57.2%,安全性54.7%,有效性62.3%;高风险场景下性能下降13.3%,领域专用模型表现更优。

Insight: 领域专用医疗LLM在临床应用中表现更稳定,CSEDB为医疗LLM的部署提供了风险识别和改进方向的依据。

Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

[34] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Keer Lu,Zheng Liang,Youquan Li,Jiejun Tan,Da Pan,Shusen Zhang,Guosheng Dong,Huang Leng

Main category: cs.CL

TL;DR: 本文提出了Med-R$^3$,一种基于渐进式强化学习的医学检索增强推理框架,通过联合优化检索与推理能力,显著提升了大型语言模型在医学领域的效果。

Details Motivation: 在医学场景中,现有方法往往单独优化检索或推理能力,缺乏对两者协调的联合优化,且过度依赖监督微调(SFT),限制了模型的泛化能力。此外,通用领域的强化学习方法未充分考虑医学领域的特殊需求。

Contribution: 提出了Med-R$^3$框架,首次联合优化检索与推理能力,设计渐进式强化学习方法,并针对医学领域特点优化奖励函数。

Method: 1. 模型首先训练医学问题的逻辑推理能力;2. 在此基础上,自适应优化检索能力,使其与知识库和推理过程更匹配;3. 最终联合优化检索与推理的协调能力。

Result: Med-R$^3$显著提升了模型性能,LLaMA3.1-8B-Instruct + Med-R$^3$超越GPT-4o-mini 3.93%,Qwen2.5-14B + Med-R$^3$提升13.53%。

Insight: 医学领域的检索增强推理需要联合优化检索与推理能力,且渐进式强化学习能更好地适应其复杂性和特殊性。

Abstract: In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce Med-R$^3$, a Medical Retrieval-augmented Reasoning framework driven by progressive Reinforcement learning. In this framework, we first develop the model’s ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model’s retrieval and reasoning coordination. Extensive experiments indicate that Med-R$^3$ could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53%.

[35] T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

Alva West,Luodan Zhang,Liuliu Zhang,Minjun Zhu,Yixuan Weng,Yue Zhang

Main category: cs.CL

TL;DR: T-Detect是一种新颖的对抗性机器生成文本检测方法,通过使用重尾的Student’s t分布替换传统的高斯归一化,提高了对统计异常值的鲁棒性,并在多个基准测试中表现优异。

Details Motivation: 现有零样本检测器在假设高斯分布的前提下,难以应对对抗性或非原生英语文本的重尾统计特征,导致检测性能下降。

Contribution: 1. 提出了基于Student’s t分布的新型统计归一化方法;2. 验证了方法在对抗性条件下的鲁棒性;3. 在RAID和HART数据集上实现了SOTA性能。

Method: T-Detect通过使用t分布计算重尾差异分数,替代传统的高斯归一化,并基于对数似然与t分布期望矩的归一化计算检测分数。

Result: 在RAID基准测试中,AUROC提升高达3.9%,并在Books领域达到0.926的SOTA表现。

Insight: 对抗性文本具有明显的尖峰厚尾特征,传统高斯假设不适用,重尾统计模型更适合此类检测任务。

Abstract: The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student’s t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.

[36] DiffLoRA: Differential Low-Rank Adapters for Large Language Models

Alexandre Misrahi,Nadezhda Chirkova,Maxime Louis,Vassilina Nikoulina

Main category: cs.CL

TL;DR: DiffLoRA引入了一种参数高效的自适应方法,通过在差分注意力机制中结合低秩适配器,旨在保留LoRA效率的同时提升性能。尽管在多任务评测中表现一般,但在部分领域(如HumanEval)有显著提升。

Details Motivation: 研究动机是结合差分注意力机制的性能优势与LoRA的参数高效性,以探索更高效的模型微调方法。

Contribution: 主要贡献是提出了DiffLoRA,一种基于差分注意力的低秩适配器方法,实现了参数高效性与性能的平衡。

Method: 方法核心是在差分注意力机制的正负项上应用低秩适配器,结合LoRA的参数效率。通过多任务实验验证其效果。

Result: 实验结果显示,DiffLoRA在大多数任务中表现不如其他参数高效微调方法,但在HumanEval任务上比LoRA提升了11分。

Insight: 分析表明,DiffLoRA在某些领域的性能提升可能源于其独特的注意力模式,这为未来优化提供了方向。

Abstract: Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.

[37] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Nasim Shirvani-Mahdavi,Devin Wingfield,Amin Ghasemi,Chengkai Li

Main category: cs.CL

TL;DR: 论文探索了利用大语言模型为知识图谱中的逻辑规则生成自然语言解释的方法,提出了多种提示策略,并通过人类评测验证了其正确性和清晰度。

Details Motivation: 知识图谱中的逻辑规则复杂且难以理解,研究者希望通过自然语言解释提升其可读性和实用性。

Contribution: 提出了Rule2Text方法,利用大语言模型生成逻辑规则的解释,并通过评测验证其效果。

Method: 从FB15k-237等数据集中提取逻辑规则,使用AMIE算法,尝试零样本、少样本提示和链式推理等策略。

Result: 生成的解释在正确性和清晰度上表现良好,但仍存在一些挑战。

Insight: 大语言模型能有效生成规则解释,但需进一步解决幻觉等问题。

Abstract: Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.

[38] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

Yunxiang Yan,Tomohiro Sawada,Kartik Goyal

Main category: cs.CL

TL;DR: 该论文提出了一种基于级联问题披露(cascaded question disclosure)的框架,用于更准确地评估大型语言模型(LLM)的底层问题解决能力,同时保持评测的自动化和可扩展性。通过阶段性地逐步揭示问题信息,该方法能更公平地比较不同LLM,并生成比标准问答范式更好的中间推理痕迹。经验证,该方法缩小了标准评测中的性能差距,表明传统问答评测可能高估了模型间的差异。

Details Motivation: 当前基于问答(QA)基准的评测方法虽自动且可扩展,但间接评估模型的底层问题解决能力存在局限性。因此,论文提出一种更直接且普适的框架,以更准确地反映模型的真实推理和问题解决能力。

Contribution: 主要贡献包括:1)提出级联问题披露框架,通过逐步揭示问题信息更全面地评测LLM;2)证明该方法能诱导更优的中间推理痕迹,缩小模型间的性能差距;3)通过多样化数据集和消融实验验证了框架的有效性和普适性。

Method: 该方法以阶段化方式逐步披露问题信息,每阶段揭示部分问题内容,从而引导模型展示其底层推理能力。与传统QA评测相比,该方法通过分阶段响应收集,更公平地比较不同LLM的推理能力。

Result: 实验表明,该方法不仅改进了模型间的比较,还生成了更清晰的中间推理轨迹。与传统评测相比,它在多样化的推理和知识密集型QA数据上缩小了模型间的性能差距,表明标准评测可能高估了模型差异。

Insight: 论文揭示了当前QA评测的局限性,即间接评测可能掩盖模型的真实问题解决能力。通过分阶段披露信息,该方法提供了一种更公平、透明的评测方式,对LLM能力的评估更接近其底层推理和知识运用能力。

Abstract: While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models’ problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.

cs.CV [Back]

[39] CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam

Ruslan Khrulev

Main category: cs.CV

TL;DR: 该论文提出了一个新的评测基准EGE-Math Solutions Assessment Benchmark,专注于评估视觉语言模型(VLMs)在手写数学解答评分方面的能力,揭示了当前模型在数学推理和人类评分标准对齐上的局限性。

Details Motivation: 现有评测基准主要关注数学问题的解决,而缺乏对学生解答理解的评估。该论文填补了这一空白,专注于手写解答的评分、错误识别和按固定标准打分。

Contribution: 1. 推出EGE-Math Solutions Assessment Benchmark,包含122份俄罗斯统一国家考试的扫描解答及专家评分。
2. 评估了七种现代视觉语言模型的性能,发现其在数学推理和评分标准对齐上的不足。

Method: 1. 收集并整理手写数学解答和专家评分作为基准数据。
2. 在三种推理模式下测试Google、OpenAI、Arcee AI和阿里云的七种VLMs。

Result: 实验结果表明,现有模型在数学推理和人类评分标准对齐方面存在显著局限性,为AI辅助评分领域的研究提供了新方向。

Insight: 该研究揭示了VLMs在数学评分任务中的潜力与挑战,强调了改进数学推理和评分标准对齐的重要性。

Abstract: This paper introduces a novel benchmark, EGE-Math Solutions Assessment Benchmark, for evaluating Vision-Language Models (VLMs) on their ability to assess hand-written mathematical solutions. Unlike existing benchmarks that focus on problem solving, our approach centres on understanding student solutions, identifying mistakes, and assigning grades according to fixed criteria. We compile 122 scanned solutions from the Russian Unified State Exam (EGE) together with official expert grades, and evaluate seven modern VLMs from Google, OpenAI, Arcee AI, and Alibaba Cloud in three inference modes. The results reveal current limitations in mathematical reasoning and human-rubric alignment, opening new research avenues in AI-assisted assessment. You can find code in https://github.com/Karifannaa/Auto-check-EGE-math

[40] Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Giuseppe Cartella,Vittorio Cuculo,Alessandro D’Amelio,Marcella Cornia,Giuseppe Boccignone,Rita Cucchiara

Main category: cs.CV

TL;DR: ScanDiff是一种结合扩散模型和Vision Transformers的新型架构,用于生成多样且真实的人眼扫描路径,通过显式建模扫描路径的变异性,优于现有方法。

Details Motivation: 现有深度学习模型在预测人眼扫描路径时通常生成平均行为,无法捕捉人类视觉探索的变异性。

Contribution: 提出ScanDiff架构,结合扩散模型和Vision Transformers,生成多样且准确的人眼扫描路径,并引入文本条件以支持任务驱动的路径生成。

Method: 利用扩散模型的随机性建模扫描路径变异性,结合Vision Transformers和文本条件处理任务驱动的生成。

Result: 在基准数据集上,ScanDiff在自由观看和任务驱动场景中均优于现有方法,生成更多样且准确的扫描路径。

Insight: 扩散模型的随机性可以有效建模人眼视觉行为的变异性,文本条件进一步增强了任务的适应性。

Abstract: Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.

[41] Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu,Om Prabhu,Annu,Mohanasankar Sivaprakasam

Main category: cs.CV

TL;DR: 该论文探讨了在资源受限环境中,通过超分辨率技术(SR)提升低质量超声心动图的分类准确率,为AI辅助诊断提供支持。

Details Motivation: 在资源匮乏地区,超声心动图质量较差,影响了自动诊断模型的性能。超分辨率技术在其他医学影像中已表现出潜力,但在超声心动图中的应用尚未充分研究。

Contribution: 论文的主要贡献是验证了深度学习超分辨率技术(如SRGAN和SRResNet)在提升低质量超声心动图分类任务中的有效性,尤其在简单和复杂任务中均取得显著性能提升。

Method: 研究使用CAMUS数据集,按图像质量分层,测试了两种SR模型(SRGAN和SRResNet)在两种临床任务(2CH vs. 4CH视图分类和ED vs. ES阶段分类)中的表现。

Result: 实验结果表明,SRResNet在提升分类性能的同时具有更高的计算效率,显著恢复了低质量超声心动图的诊断价值。

Insight: 超分辨率技术可有效弥补资源受限环境中影像质量的不足,为AI辅助诊断提供了实用解决方案。

Abstract: Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.

[42] Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields

Ranxi Lin,Canming Yao,Jiayi Li,Weihang Liu,Xin Lou,Pingqiang Zhou

Main category: cs.CV

TL;DR: 该论文提出了PATA方法,通过动态调整时间步长,在基于SNN的NeRF框架中平衡渲染质量与计算效率,显著减少了推理时间和功耗。

Details Motivation: NeRF在3D重建和渲染任务中表现优异,但依赖密集点采样导致高计算开销,限制了其在资源受限场景的应用。SNN因其低能耗特性成为潜在解决方案。

Contribution: 提出了基于SNN的动态时间步长训练策略PATA,实现了场景自适应的推理,减少了64%的推理时间步长和61.55%的运行功耗。

Method: 采用Pretrain-Adaptive Time-step Adjustment策略,动态调整训练中的时间步长,结合Instant-NGP架构进行优化。

Result: 实验表明,PATA在保持渲染质量的同时,显著降低了计算资源消耗。

Insight: 动态时间步长策略可有效平衡SNN在神经渲染中的效率与质量,为资源受限场景提供了实用解决方案。

Abstract: Neural Radiance Fields (NeRF)-based models have achieved remarkable success in 3D reconstruction and rendering tasks. However, during both training and inference, these models rely heavily on dense point sampling along rays from multiple viewpoints, resulting in a surge in floating-point operations and severely limiting their use in resource-constrained scenarios like edge computing. Spiking Neural Networks (SNNs), which communicate via binary spikes over discrete time steps, offer a promising alternative due to their energy-efficient nature. Given the inherent variability in scene scale and texture complexity in neural rendering and the prevailing practice of training separate models per scene, we propose a spike-based NeRF framework with a dynamic time step training strategy, termed Pretrain-Adaptive Time-step Adjustment (PATA). This approach automatically explores the trade-off between rendering quality and time step length during training. Consequently, it enables scene-adaptive inference with variable time steps and reduces the additional consumption of computational resources in the inference process. Anchoring to the established Instant-NGP architecture, we evaluate our method across diverse datasets. The experimental results show that PATA can preserve rendering fidelity while reducing inference time steps by 64% and running power by 61.55%.

[43] Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving

Santosh Patapati,Trisanth Srinivasan

Main category: cs.CV

TL;DR: NovaDrive提出了一种实时视觉-语言驾驶架构,通过多模态融合和轻量级交叉注意力块优化性能,显著提升了自动驾驶的成功率和路径效率。

Details Motivation: 自动驾驶需要在复杂环境下快速反应,当前方法在实时性和多模态融合上存在不足,NovaDrive旨在解决这些问题。

Contribution: 1. 提出多尺度融合的单分支架构NovaDrive;2. 设计轻量级交叉注意力块和平滑性损失函数;3. 通过部分微调实现实时推理。

Method: 1. 单分支处理图像、HD地图、LiDAR和文本路径点;2. 两阶段交叉注意力对齐;3. 平滑性损失优化驾驶行为。

Result: nuScenes/Waymo上,成功率提升4%,路径效率提升0.11,碰撞率降低1.4%。

Insight: 路径点标记和部分VLM微调对性能提升最关键;平滑性损失还能减少能耗。

Abstract: Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive’s shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.

[44] Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation

Alexandru Buburuzan

Main category: cs.CV

TL;DR: 论文提出了两种新方法(MObI和AnydoorMed)用于自动驾驶和医学影像分析领域的多模态合成数据生成,基于扩散模型实现高真实感和可控性。

Details Motivation: 安全关键应用(如自动驾驶和医学影像分析)需要大量多模态数据测试,但真实数据采集成本高且复杂,亟需高真实感和可控性的合成数据方法。

Contribution: 1. MObI:首个多模态目标修复框架,支持相机和激光雷达数据的3D目标插入;2. AnydoorMed:医学影像中的参考引导修复方法,保留异常结构并与周围组织语义融合。

Method: 1. 利用扩散模型,通过3D边界框条件实现目标空间定位和尺寸控制(MObI);2. 医学影像中基于扩散模型的参考引导修复(AnydoorMed)。

Result: 所提方法在自动驾驶和医学影像中实现了高真实感、可控的多模态数据生成,验证了基础模型的普适性。

Insight: 扩散模型在跨模态合成数据生成中展现出潜力,为构建高真实感反事实场景提供了新思路。

Abstract: Safety-critical applications, such as autonomous driving and medical image analysis, require extensive multimodal data for rigorous testing. Synthetic data methods are gaining prominence due to the cost and complexity of gathering real-world data, but they demand a high degree of realism and controllability to be useful. This work introduces two novel methods for synthetic data generation in autonomous driving and medical image analysis, namely MObI and AnydoorMed, respectively. MObI is a first-of-its-kind framework for Multimodal Object Inpainting that leverages a diffusion model to produce realistic and controllable object inpaintings across perceptual modalities, demonstrated simultaneously for camera and lidar. Given a single reference RGB image, MObI enables seamless object insertion into existing multimodal scenes at a specified 3D location, guided by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, this approach uses 3D bounding box conditioning to ensure accurate spatial positioning and realistic scaling. AnydoorMed extends this paradigm to the medical imaging domain, focusing on reference-guided inpainting for mammography scans. It leverages a diffusion-based model to inpaint anomalies with impressive detail preservation, maintaining the reference anomaly’s structural integrity while semantically blending it with the surrounding tissue. Together, these methods demonstrate that foundation models for reference-guided inpainting in natural images can be readily adapted to diverse perceptual modalities, paving the way for the next generation of systems capable of constructing highly realistic, controllable and multimodal counterfactual scenarios.

[45] Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints

Santosh Patapati,Trisanth Srinivasan,Murari Ambati

Main category: cs.CV

TL;DR: 论文提出了一种名为XYZ-Drive的单视觉语言模型,通过目标中心跨注意力层实现摄像头、高清地图和路径点的融合,显著提升了自动驾驶的实时性和准确性。

Details Motivation: 自动驾驶需要同时处理几何精度和语义理解,而现有方法通常将它们分开处理。XYZ-Drive的目标是通过多模态融合解决这一问题,实现更高效的自动驾驶。

Contribution: 1. 提出一种轻量级目标中心跨注意力层,支持路径点、图像和地图的融合;2. 展示了多模态融合对自动驾驶任务的重要性;3. 通过实验验证了模块优化的必要性,如微调和地图分辨率的影响。

Method: 采用LLaMA-3.2 11B模型,结合目标中心跨注意力层,融合摄像头帧、高清地图和路径点,输出转向和速度指令。

Result: 在MD-NEX Outdoor-Driving基准测试中,XYZ-Drive实现了95%的成功率和0.80的SPL,性能优于PhysNav-DG 15%,碰撞率减半。

Insight: 多模态融合和目标中心注意力机制对自动驾驶任务至关重要;微调预训练模型和保持高分辨率地图是提升性能的关键。

Abstract: Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.

[46] Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model

Dmitry Demidov,Zaigham Zaheer,Omkar Thawakar,Salman Khan,Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: 论文提出了一种无需词汇表的细粒度视觉识别方法E-FineR,通过结合语言模型与视觉语言模型的丰富上下文,实现了开放集识别,且在零样本和小样本分类中表现出色。

Details Motivation: 传统细粒度图像分类方法依赖固定词汇表和封闭集分类,难以应对现实世界中新类别的频繁出现。结合LLM与VLM的方法虽能实现开放集识别,但在分类阶段未充分利用LLM潜力,且依赖猜测的类别名称而未深入分析。

Contribution: 提出了一种无需训练的方法E-FineR,通过丰富上下文驱动的视觉语言模型,实现了细粒度视觉识别的开放集分类,性能达到SOTA,且更具可解释性和适应性。

Method: E-FineR利用语言模型生成丰富的上下文描述,结合视觉语言模型进行细粒度识别,无需预定义类别标签或训练。

Result: 在细粒度识别任务中表现优异,同时在零样本和小样本分类中性能与现有SOTA相当,且无需人工干预。

Insight: 通过语言驱动的灵活理解,E-FineR推动了图像分类从固定标签预测向可扩展、通用化系统的转变,适用于标注困难的现实场景。

Abstract: Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on https://github.com/demidovd98/e-finer.

[47] Details Matter for Indoor Open-vocabulary 3D Instance Segmentation

Sanghun Jung,Jingjing Zheng,Ke Zhang,Nan Qiao,Albert Y. C. Chen,Lu Xia,Chi Liu,Yuyin Sun,Xiao Zeng,Hsiang-Wei Huang,Byron Boots,Min Sun,Cheng-Hao Kuo

Main category: cs.CV

TL;DR: 本文提出了一种新的开放词汇3D实例分割框架,通过结合3D提议生成和实例分类两阶段方法,以及改进的Alpha-CLIP模型和标准化最大相似度(SMS)评分,在ScanNet200和S3DIS数据集上实现了最先进的性能。

Details Motivation: 现有的开放词汇3D实例分割方法虽然提出了多种概念,但这些概念是互补的而非互斥的。作者希望通过结合和优化这些概念,解决现有方法的挑战并提升性能。

Contribution: 1. 提出了两阶段方法,结合3D提议生成和实例分类;2. 使用Alpha-CLIP替代标准CLIP,减少背景噪声;3. 引入了SMS评分,提升分类精度。

Method: 采用3D跟踪生成提议,通过迭代合并/删除去除重叠或部分提议;使用Alpha-CLIP进行实例分类,并引入SMS评分标准化相似度。

Result: 在ScanNet200和S3DIS数据集上超越了所有AP和AR指标,甚至优于封闭词汇的端到端方法。

Insight: 1. 细节优化是提升开放词汇3D实例分割性能的关键;2. 结合互补的概念比单独使用一种方法更有效;3. Alpha-CLIP和SMS评分的引入显著提升了分类精度。

Abstract: Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.

[48] X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention

Xiaochen Zhao,Hongyi Xu,Guoxian Song,You Xie,Chenxu Zhang,Xiu Li,Linjie Luo,Jinli Suo,Yebin Liu

Main category: cs.CV

TL;DR: X-NeMo提出了一种基于扩散模型的零样本肖像动画方法,通过解耦的潜在注意力机制,解决了身份泄漏和表情捕捉难题,实现了高质量动画生成。

Details Motivation: 现有方法在肖像动画中存在身份泄漏和难以捕捉细微及极端表情的问题。X-NeMo旨在通过解耦潜在运动描述符,实现更精确的表情控制。

Contribution: 1. 提出了一种端到端训练框架,从驱动视频中提取1D身份无关的运动描述符。2. 引入双重GAN解码器和空间-颜色增强技术,进一步解耦运动与身份信息。

Method: 1. 使用扩散模型和交叉注意力机制控制运动。2. 通过1D潜在向量嵌入驱动运动,避免空间对齐的结构泄露。3. 利用双重GAN解码器和数据增强技术优化运动描述符的学习。

Result: X-NeMo在实验中优于现有基准,生成的表情动画更具表现力且身份相似度更高。

Insight: 解耦运动与身份信息的潜在注意力机制是关键创新,为肖像动画提供了新思路。

Abstract: We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.

[49] Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

Shiyao Yu,Zi-An Wang,Kangning Yin,Zheng Tian,Mingyuan Zhang,Weixin Si,Shihao Zou

Main category: cs.CV

TL;DR: 该论文提出了一种多模态运动检索框架,通过联合嵌入空间对齐文本、音频、视频和运动四种模态,首次引入音频以提升沉浸感和用户便利性。

Details Motivation: 现有运动检索方法通常基于对比学习构建统一嵌入空间,但缺乏直观的用户交互,且忽略了模态的序列表征。

Contribution: 论文首次在运动检索中引入音频模态,并提出序列级对比学习方法,构建细粒度联合嵌入空间。

Method: 采用序列级对比学习对齐四种模态(文本、音频、视频、运动),并通过增强数据集验证框架性能。

Result: 实验显示在HumanML3D数据集上,文本到运动检索R@10提升10.16%,视频到运动检索R@1提升25.43%。

Insight: 四模态框架明显优于三模态版本,证实多模态运动检索在提升运动捕捉技术中的潜力。

Abstract: Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities – text, audio, video, and motion – within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.

[50] A Novel Dataset for Flood Detection Robust to Seasonal Changes in Satellite Imagery

Youngsun Jang,Dongyoun Kim,Chulwoo Pack,Kwanghee Won

Main category: cs.CV

TL;DR: 该论文介绍了一个新的卫星图像数据集,用于洪涝区域的语义分割,弥补了现有数据集在该任务上的不足,并通过实验验证了现有模型的性能。

Details Motivation: 现有的卫星影像数据集在洪涝区域分割任务上存在不足,且季节性变化对图像特征的影响尚未得到充分研究,因此需要一个新的数据集来填补这一空白。

Contribution: 论文的主要贡献是发布了一个新的洪涝检测数据集(2019年美国中西部洪水卫星图像),包含均匀分辨率的图像,并为未来的多模态与时序学习策略提供了实验基准。

Method: 通过从Planet Labs收集卫星图像,精选10个地点(每个地点10张图像),进行统一的分辨率和尺寸处理,并测试了多种语义分割模型。此外,还通过消融实验研究了窗口大小对性能的影响。

Result: 实验结果表明,现有模型在该数据集上表现一般,说明需要进一步开发多模态与时序学习方法以提升性能。

Insight: 季节性变化可能对卫星图像的洪涝检测造成显著影响,未来的研究应结合更多模态和时序信息以提高模型的鲁棒性。

Abstract: This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \c{opyright} 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non-flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state-of-the-art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on https://github.com/youngsunjang/SDSU_MidWest_Flood_2019.

[51] Adversarial-Guided Diffusion for Multimodal LLM Attacks

Chengwei Xia,Fan Ma,Ruijie Quan,Kun Zhan,Yi Yang

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散模型的对抗攻击方法AGD,通过对抗引导噪声欺骗多模态大语言模型(MLLMs),同时避免图像显著失真。AGD将目标语义注入反向扩散的噪声中,使其具有全频谱特性,从而对多种防御方法具有鲁棒性。实验表明,AGD在攻击性能和抗防御能力上优于现有方法。

Details Motivation: 多模态大语言模型(MLLMs)的安全问题日益突出,传统对抗攻击方法通常嵌入高频扰动到干净图像中,容易被简单的低通滤波防御。论文旨在提出一种更鲁棒的对抗攻击方法,利用扩散模型的特性实现高效攻击。

Contribution: 提出了对抗引导扩散(AGD)方法,将目标语义注入扩散噪声中,而非直接嵌入图像,使得对抗信号具有全频谱特性;证明了AGD对多种防御方法的鲁棒性。

Method: 采用扩散模型的逆向扩散过程,在噪声部分注入对抗信号;对抗图像由干净图像和噪声线性组合形成,避免高频扰动集中的问题。

Result: 实验显示AGD在攻击MLLMs时优于现有方法,且在抗低通滤波等防御措施上表现更稳健。

Insight: 扩散模型的噪声部分可用于嵌入对抗信号,因其全频谱特性使得对抗攻击更难以防御,为对抗攻击设计提供了新思路。

Abstract: This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.

[52] Toward Safe, Trustworthy and Realistic Augmented Reality User Experience

Yanming Xiu

Main category: cs.CV

TL;DR: 论文致力于提高增强现实(AR)的安全性和可信度,开发了ViDDAR和VIM-Sense系统以检测有害虚拟内容,并提出了未来研究方向。

Details Motivation: 随着AR日益融入日常生活,确保虚拟内容的安全性和可信度变得至关重要,特别是防止其阻碍关键信息或操纵用户感知。

Contribution: 1. 提出了ViDDAR和VIM-Sense系统,利用视觉语言模型(VLM)和多模态推理模块检测有害AR内容。2. 提出了未来研究方向,包括自动化感知对齐的虚拟内容质量评估、多模态攻击检测以及轻量化VLM的适应性部署。

Method: 1. 开发了ViDDAR和VIM-Sense系统,结合VLM和多模态推理检测有害AR内容。2. 探索了轻量化模型在AR设备上的适应性部署。

Result: 论文通过系统和理论框架为AR体验的安全性提供了初步解决方案,并提出了进一步优化的方向。

Insight: 安全的AR体验需要结合多模态检测和轻量化模型部署,未来的研究应注重感知对齐和用户中心的设计。

Abstract: As augmented reality (AR) becomes increasingly integrated into everyday life, ensuring the safety and trustworthiness of its virtual content is critical. Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception. We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules. Building on this foundation, we propose three future directions: automated, perceptually aligned quality assessment of virtual content; detection of multimodal attacks; and adaptation of VLMs for efficient and user-centered deployment on AR devices. Overall, our work aims to establish a scalable, human-aligned framework for safeguarding AR experiences and seeks feedback on perceptual modeling, multimodal AR content implementation, and lightweight model adaptation.

[53] Ambiguity-Guided Learnable Distribution Calibration for Semi-Supervised Few-Shot Class-Incremental Learning

Fan Lyu,Linglan Zhao,Chengyan Liu,Yinying Mei,Zhang Zhang,Jian Zhang,Fuyuan Hu,Liang Wang

Main category: cs.CV

TL;DR: 论文提出了一种广义半监督少样本类增量学习(GSemi-FSCIL)问题,并通过Ambiguity-guided Learnable Distribution Calibration(ALDC)策略解决现有方法在区分基础类和新增类未标记样本上的挑战。

Details Motivation: 现实场景中,未标记数据可能来自基础类或所有历史新增类,而现有方法假设未标记数据仅来自当前会话的新增类,与实际不符。因此,作者重新定义了广义Semi-FSCIL,并提出了ALDC以动态校准特征分布。

Contribution: 1. 重新定义了广义Semi-FSCIL问题;2. 提出了ALDC策略,动态利用基础类样本校准新增类的特征分布;3. 在三个基准数据集上取得了SOTA结果。

Method: ALDC通过动态利用丰富的基类未标记样本来修正少样本新增类的特征分布偏差,从而解决广义Semi-FSCIL中的未标记样本混淆问题。

Result: 实验表明,ALDC在三个基准数据集上显著优于现有方法,确立了新的SOTA性能。

Insight: 广义Semi-FSCIL更贴合实际场景,而ALDC通过动态分布校准有效提升了模型在少样本和未标记数据混合环境中的表现。

Abstract: Few-Shot Class-Incremental Learning (FSCIL) focuses on models learning new concepts from limited data while retaining knowledge of previous classes. Recently, many studies have started to leverage unlabeled samples to assist models in learning from few-shot samples, giving rise to the field of Semi-supervised Few-shot Class-Incremental Learning (Semi-FSCIL). However, these studies often assume that the source of unlabeled data is only confined to novel classes of the current session, which presents a narrow perspective and cannot align well with practical scenarios. To better reflect real-world scenarios, we redefine Semi-FSCIL as Generalized Semi-FSCIL (GSemi-FSCIL) by incorporating both base and all the ever-seen novel classes in the unlabeled set. This change in the composition of unlabeled samples poses a new challenge for existing methods, as they struggle to distinguish between unlabeled samples from base and novel classes. To address this issue, we propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy. ALDC dynamically uses abundant base samples to correct biased feature distributions for few-shot novel classes. Experiments on three benchmark datasets show that our method outperforms existing works, setting new state-of-the-art results.

[54] Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents

Sungguk Cha,DongWook Kim,Taeseung Hahn,Mintae Kim,Youngsub Han,Byoung-Ki Jeon

Main category: cs.CV

TL;DR: 该论文提出了RL-QR,一种基于强化学习的查询重写框架,可针对特定检索器优化查询,无需人工标注数据,适用于文本和多模态数据库。实验表明,RL-QR在多模态和词汇检索器中性能显著提升,但在语义和混合检索器中表现不佳。

Details Motivation: 现有检索增强生成(RAG)系统的查询优化依赖于人工标注数据,且难以适应多样化的非结构化真实世界文档,亟需一种可扩展且无需人工干预的解决方案。

Contribution: 1. 提出RL-QR框架,通过强化学习训练特定于检索器的查询重写模型;2. 引入了广义奖励策略优化(GRPO);3. 展示了RL-QR在文本和多模态检索任务中的有效性。

Method: 1. 合成场景-问题对作为训练数据;2. 使用GRPO优化查询重写策略;3. 针对特定检索器定制训练。

Result: RL-QR在多模态检索中NDCG@3提升11%,在词汇检索器中提升9%,但在语义和混合检索器中未观察到改进。

Insight: RL-QR为RAG系统提供了一种可扩展的查询优化方案,但在语义检索上的局限性提示需要进一步研究训练对齐问题。

Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}{\text{multi-modal}}$ achieving an 11% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}{\text{lexical}}$ yielding a 9% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR’s potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.

[55] A Deep Dive into Generic Object Tracking: A Survey

Fereshteh Aghaee Meibodi,Shadi Alijani,Homayoun Najjaran

Main category: cs.CV

TL;DR: 这篇论文对通用目标跟踪领域进行了全面综述,重点分析了包括基于Siamese网络、判别式以及近期兴起的基于Transformer的三类方法,并特别强调了Transformer方法的快速发展。

Details Motivation: 通用目标跟踪因复杂的时空动态性和遮挡、相似干扰物等问题具有挑战性。尽管已有一些综述论文,但本文旨在全面覆盖所有主要跟踪范式,尤其是快速发展的Transformer方法。

Contribution: 论文提出了对三类方法的全新分类,并通过定性和定量比较分析了其设计原则、创新与限制;同时提供了统一的视觉和表格对比,总结了评估基准和进展。

Method: 通过对Siamese网络、判别式方法和Transformer方法的系统分析,论文提出了一种新的分类方式,并对比了代表性方法的核心设计。

Result: 研究表明,Transformer方法因其强大的时空建模能力推动了目标跟踪的快速发展。

Insight: 论文指出Transformer方法在跟踪任务中的潜力,同时强调了未来研究需关注其在复杂场景中的鲁棒性和效率。

Abstract: Generic object tracking remains an important yet challenging task in computer vision due to complex spatio-temporal dynamics, especially in the presence of occlusions, similar distractors, and appearance variations. Over the past two decades, a wide range of tracking paradigms, including Siamese-based trackers, discriminative trackers, and, more recently, prominent transformer-based approaches, have been introduced to address these challenges. While a few existing survey papers in this field have either concentrated on a single category or widely covered multiple ones to capture progress, our paper presents a comprehensive review of all three categories, with particular emphasis on the rapidly evolving transformer-based methods. We analyze the core design principles, innovations, and limitations of each approach through both qualitative and quantitative comparisons. Our study introduces a novel categorization and offers a unified visual and tabular comparison of representative methods. Additionally, we organize existing trackers from multiple perspectives and summarize the major evaluation benchmarks, highlighting the fast-paced advancements in transformer-based tracking driven by their robust spatio-temporal modeling capabilities.

[56] Towards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2

Solha Kang,Eugene Kim,Joris Vankerschaver,Utku Ozbulak

Main category: cs.CV

TL;DR: 该论文探讨了如何利用SAM2(Segment Anything Model 2)在低成本、最小输入的情况下,实现3D乳房MRI中的肿瘤分割。通过单一切片的边界框标注,采用三种切片级跟踪策略(从上到下、从下到上、从中心向外)传播分割预测。中心向外策略表现最佳,尽管SAM2未经体积医学数据训练,但在最小监督下仍表现出色。

Details Motivation: 乳房MRI的高分辨率体积成像对肿瘤评估至关重要,但手动分析3D扫描耗时且主观。商业AI产品因高成本和基础设施需求难以在低收入和中等收入国家普及,因此需要一种低成本、易用的替代方案。

Contribution: 论文的主要贡献是证明SAM2可以在最小输入(单一切片边界框)下完成3D肿瘤分割,并评估了三种传播策略在性能上的差异。此外,分析了肿瘤大小、位置和形状对分割性能的影响。

Method: 采用SAM2模型,通过在单一切片上标注边界框,利用三种切片级跟踪策略(从上到下、从下到上、从中心向外)传播分割预测到整个3D体积。选择性能最优的策略进行最终评估。

Result: 中心向外传播策略在分割一致性和准确性上表现最佳。尽管SAM2未经体积医学数据训练,但在最小监督下仍实现了强大的分割性能,同时识别了关键失败模式。

Insight: 通用基础模型(如SAM2)可以在最小监督下支持3D医学图像分析,为资源受限地区提供了一种经济高效的替代方案。

Abstract: Breast MRI provides high-resolution volumetric imaging critical for tumor assessment and treatment planning, yet manual interpretation of 3D scans remains labor-intensive and subjective. While AI-powered tools hold promise for accelerating medical image analysis, adoption of commercial medical AI products remains limited in low- and middle-income countries due to high license costs, proprietary software, and infrastructure demands. In this work, we investigate whether the Segment Anything Model 2 (SAM2) can be adapted for low-cost, minimal-input 3D tumor segmentation in breast MRI. Using a single bounding box annotation on one slice, we propagate segmentation predictions across the 3D volume using three different slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. We evaluate these strategies across a large cohort of patients and find that center-outward propagation yields the most consistent and accurate segmentations. Despite being a zero-shot model not trained for volumetric medical data, SAM2 achieves strong segmentation performance under minimal supervision. We further analyze how segmentation performance relates to tumor size, location, and shape, identifying key failure modes. Our results suggest that general-purpose foundation models such as SAM2 can support 3D medical image analysis with minimal supervision, offering an accessible and affordable alternative for resource-constrained settings.

[57] iLRM: An Iterative Large 3D Reconstruction Model

Gyeongjin Kang,Seungtae Nam,Xiangyu Sun,Sameh Khamis,Abdelrahman Mohamed,Eunbyung Park

Main category: cs.CV

TL;DR: iLRM是一种迭代式大型3D重建模型,通过解耦场景表示与输入视图、分解多视图注意力机制及高分辨率信息注入,提升了重建质量和速度。

Details Motivation: 当前基于Transformer的3D重建方法因全注意力机制在多视图和高分辨率输入时计算成本过高,难以扩展。iLRM旨在解决这一问题。

Contribution: 提出iLRM模型,通过三个核心原则实现高效且高质量的3D重建:解耦场景表示、两阶段注意力机制和高分辨率信息注入。

Method: iLRM采用迭代优化机制生成3D高斯表示,通过解耦和注意力分解降低计算成本,同时在每一层注入高分辨率信息。

Result: 在RE10K和DL3DV等数据集上,iLRM在重建质量和速度上优于现有方法,且具有更好的扩展性。

Insight: 解耦和注意力分层机制是提升3D重建效率和扩展性的有效途径。

Abstract: Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.

[58] UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang,Chenwei Xie,Xiaoyi Bao,Tingyu Weng,Pandeng Li,Yun Zheng,Liwei Wang

Main category: cs.CV

TL;DR: UniLiP扩展了CLIP的能力,使其不仅适用于理解和生成任务,还能进行图像编辑。通过两阶段训练和自蒸馏策略,UniLiP在保持原始理解性能的同时,实现了高效的图像重建。在生成和编辑任务中,UniLiP表现优于同类统一模型。

Details Motivation: 现有基于CLIP的统一方法通常需要额外的扩散解码器或量化来支持重建和生成任务,这可能导致性能下降。UniLiP旨在解决这一问题,通过统一架构实现多任务的高效协同。

Contribution: 1. 提出两阶段训练和自蒸馏策略,将重建能力集成到CLIP中。2. 设计双条件架构,连接MLLM和扩散变换器,充分利用其推理能力。3. 在生成和编辑任务中性能优于现有模型。

Method: 1. 两阶段训练:第一阶段保持CLIP理解能力,第二阶段逐步引入重建任务。2. 自蒸馏策略:通过自蒸馏保留原始性能。3. 双条件架构:结合可学习查询和多模态隐藏状态作为联合条件。

Result: 在文本到图像生成任务中,UniLiP在GenEval和WISE基准上的得分分别为0.87和0.53;在图像编辑任务中,ImgEdit Benchmark得分为3.62,均优于现有模型。

Insight: UniLiP展示了如何通过统一架构和策略扩展CLIP的应用范围,同时保持其在理解任务中的优势,为多模态任务的协同处理提供了新思路。

Abstract: In this paper, we propose UniLIP, which extends CLIP to reconstruction, generation and editing, thereby building a unified tokenizer upon its exceptional comprehension capabilities. Previous CLIP-based unified methods often require additional diffusion decoders or quantization to support reconstruction and generation tasks, leading to inconsistent reconstruction or degradation of original comprehension performance.In contrast, we introduce a two-stage training scheme and a self-distillation strategy that progressively integrates reconstruction capabilities into CLIP, allowing it to maintain original comprehension performance while achieving effective image reconstruction. Furthermore, we propose a dual-condition architecture to connect the MLLM and diffusion transformer, using both learnable queries and the last layer multimodal hidden states as joint conditions. This method not only enables the utilization of the MLLM’s strong reasoning capabilities in generation tasks, but also maximizes the exploitation of the rich information in UniLIP features during editing tasks. In text-to-image generation tasks, UniLIP obtains scores of 0.87 and 0.53 on GenEval and WISE benchmark respectively, surpassing all previous unified models of similar scale. In image editing, UniLIP also achieves a score of 3.62 on the ImgEdit Benchmark, surpassing recent state-of-the-art models such as BAGEL and UniWorld-V1. UniLIP effectively expand the application scope of CLIP, enabling continuous CLIP features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks.

[59] Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Dohwan Ko,Ji Soo Lee,Minhyuk Choi,Zihang Meng,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: 该论文提出了BLiM框架,通过双向似然估计和候选先验归一化(CPN)消除文本-视频检索中的候选先验偏差,显著提升了检索性能。

Details Motivation: 现有基于MLLM的方法在文本-视频检索中因候选先验偏差而偏向于高先验的候选,而非与查询更相关的候选。

Contribution: 1. 提出BLiM框架,结合查询和候选的似然估计;2. 引入CPN模块,无需训练即可校准分数。

Method: BLiM通过双向生成(文本生成视频特征和视频生成文本)估计似然;CPN通过校准分数消除先验偏差。

Result: 在四个基准测试中,BLiM+CPN平均提升R@1 6.4%,显著减轻了候选先验偏差。

Insight: CPN模块在多模态任务中具有广泛适用性,可减少对文本先验的依赖,增强视觉理解。

Abstract: Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.

[60] LED Benchmark: Diagnosing Structural Layout Errors for Document Layout Analysis

Inbum Heo,Taewook Hwang,Jeesu Jung,Sangkeun Jung

Main category: cs.CV

TL;DR: 该论文提出了Layout Error Detection (LED)基准,用于评估文档布局分析的结构鲁棒性,定义了八种标准错误类型,并构建了合成数据集LED-Dataset。实验表明LED能有效区分不同模型的结构理解能力。

Details Motivation: 现有文档布局分析的评估指标(如IoU和mAP)主要关注空间重叠,难以检测关键的结构错误(如区域合并、分割和内容缺失)。因此,需要一种新的评估方法来诊断这些结构错误。

Contribution: 1. 提出LED基准,定义八种标准错误类型;2. 构建合成数据集LED-Dataset;3. 通过实验验证LED能揭示传统指标无法捕捉的模态偏差和性能权衡。

Method: 1. 标准化八种结构错误类型;2. 设计三项任务:错误存在检测、错误类型分类和元素级错误分类;3. 基于DLA模型的实证分布生成合成数据集。

Result: 实验结果表明,LED能有效区分不同模型的结构理解能力,揭示模态偏差和性能权衡。

Insight: 传统评估指标无法充分反映模型在结构错误检测上的表现,LED提供了一种更全面的评估框架。

Abstract: Recent advancements in Document Layout Analysis through Large Language Models and Multimodal Models have significantly improved layout detection. However, despite these improvements, challenges remain in addressing critical structural errors, such as region merging, splitting, and missing content. Conventional evaluation metrics like IoU and mAP, which focus primarily on spatial overlap, are insufficient for detecting these errors. To address this limitation, we propose Layout Error Detection (LED), a novel benchmark designed to evaluate the structural robustness of document layout predictions. LED defines eight standardized error types, and formulates three complementary tasks: error existence detection, error type classification, and element-wise error type classification. Furthermore, we construct LED-Dataset, a synthetic dataset generated by injecting realistic structural errors based on empirical distributions from DLA models. Experimental results across a range of LMMs reveal that LED effectively differentiates structural understanding capabilities, exposing modality biases and performance trade-offs not visible through traditional metrics.

[61] Training-free Geometric Image Editing on Diffusion Models

Hanshen Zhu,Zhen Zhu,Kaile Zhang,Yiming Gong,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 该论文提出了一种解耦的几何图像编辑框架FreeFine,通过分离物体变换、源区域修复和目标区域细化三个步骤,提升了图像编辑的逼真度和精度。

Details Motivation: 现有的基于扩散模型的图像编辑方法通常试图在单一步骤中完成所有相关子任务,但在处理大规模或结构复杂的变换时效果不佳。论文旨在解决这一问题。

Contribution: 1) 提出了一种解耦的几何图像编辑流程;2) 开发了无需训练的扩散方法FreeFine;3) 提出了新的GeoBench基准测试集,涵盖2D和3D编辑场景。

Method: 方法分为三步:物体变换、源区域修复(使用FreeFine实现)、目标区域细化(同样使用FreeFine)。FreeFine通过无需训练的扩散模型实现高效编辑。

Result: 在GeoBench测试集上,FreeFine在图像逼真度和编辑精度上优于现有方法,尤其是在复杂变换场景中。

Insight: 解耦编辑步骤可以显著提升复杂变换任务的效果,而无需训练的扩散方法在效率和质量上具有优势。

Abstract: We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: https://github.com/CIawevy/FreeFine

[62] ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection

Xihang Hu,Fuming Sun,Jiazhe Liu,Feilong Xu,Xiaoli Zhang

Main category: cs.CV

TL;DR: ST-SAM提出了一种基于自训练的简洁框架,通过动态筛选高置信度伪标签和利用SAM模型的潜力,显著降低了半监督伪装目标检测对标注数据的依赖。

Details Motivation: 现有半监督伪装目标检测方法依赖复杂多网络结构,存在预测偏差和计算开销大的问题,ST-SAM旨在通过自训练和SAM模型的结合解决这些问题。

Contribution: 1. 提出了一种单模型的自训练框架,避免了传统教师-学生模型的预测偏差;2. 通过SAM模型生成混合提示,减少自训练中的误差累积。

Method: 1. 动态筛选高置信度伪标签;2. 将伪标签转化为混合提示,结合SAM模型进行任务优化。

Result: 在仅1%标注数据下,ST-SAM性能优于现有半监督方法,甚至接近全监督方法。

Insight: 利用SAM模型的能力可以有效减轻半监督学习中的误差积累,同时单模型架构提高了计算效率和扩展性。

Abstract: Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model’s potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.

[63] PriorFusion: Unified Integration of Priors for Robust Road Perception in Autonomous Driving

Xuewei Tang,Mengmeng Yang,Tuopu Wen,Peijin Jia,Le Cui,Mingshang Luo,Kehua Sheng,Bo Zhang,Diange Yang,Kun Jiang

Main category: cs.CV

TL;DR: PriorFusion是一个统一框架,通过整合语义、几何和生成先验,提升自动驾驶中的道路元素感知能力。其关键贡献包括基于形状先验的注意力机制和扩散模型生成准确预测。

Details Motivation: 在复杂环境中,自动驾驶车辆缺乏高精地图支持,现有方法未能充分利用道路元素的结构化先验,导致预测不规则和不准确。

Contribution: 1) 提出统一框架PriorFusion;2) 引入形状先验引导的注意力机制;3) 构建数据驱动的形状模板空间;4) 使用扩散模型生成准确预测。

Method: 1) 实例感知注意力机制结合形状先验;2) 构建低维形状模板空间生成锚点先验;3) 扩散模型利用先验生成完整预测。

Result: 在大规模数据集上,PriorFusion显著提升了道路元素的感知准确性,尤其在复杂环境下表现优异。

Insight: 通过整合多种先验(语义、几何、生成),可以有效解决道路元素感知中的不准确和碎片化问题,为自动驾驶提供更可靠的感知支持。

Abstract: With the growing interest in autonomous driving, there is an increasing demand for accurate and reliable road perception technologies. In complex environments without high-definition map support, autonomous vehicles must independently interpret their surroundings to ensure safe and robust decision-making. However, these scenarios pose significant challenges due to the large number, complex geometries, and frequent occlusions of road elements. A key limitation of existing approaches lies in their insufficient exploitation of the structured priors inherently present in road elements, resulting in irregular, inaccurate predictions. To address this, we propose PriorFusion, a unified framework that effectively integrates semantic, geometric, and generative priors to enhance road element perception. We introduce an instance-aware attention mechanism guided by shape-prior features, then construct a data-driven shape template space that encodes low-dimensional representations of road elements, enabling clustering to generate anchor points as reference priors. We design a diffusion-based framework that leverages these prior anchors to generate accurate and complete predictions. Experiments on large-scale autonomous driving datasets demonstrate that our method significantly improves perception accuracy, particularly under challenging conditions. Visualization results further confirm that our approach produces more accurate, regular, and coherent predictions of road elements.

[64] Forgetting of task-specific knowledge in model merging-based continual learning

Timm Hess,Gido M van de Ven,Tinne Tuytelaars

Main category: cs.CV

TL;DR: 论文研究了持续学习中模型线性合并的效果,发现合并主要保留或增强了共享知识,而特定任务的知识会快速退化,增量训练模型的合并效果优于并行训练模型。

Details Motivation: 探讨持续学习中模型合并的知识保留与退化问题,特别是共享知识和任务特定知识的表现,以优化模型合并策略。

Contribution: 揭示了模型合并中共享知识的保留与任务特定知识的退化现象,并验证了增量训练模型合并的优势。

Method: 通过计算机视觉实验,使用线性合并方法对比增量训练和并行训练模型的合并效果。

Result: 合并增强了共享知识,但任务特定知识快速退化;增量训练模型的合并效果更优。

Insight: 模型合并策略应关注增量训练,以更好地保留知识并减少任务特定知识的损失。

Abstract: This paper investigates the linear merging of models in the context of continual learning (CL). Using controlled visual cues in computer vision experiments, we demonstrate that merging largely preserves or enhances shared knowledge, while unshared task-specific knowledge rapidly degrades. We further find that merging models from an incremental training process consistently outperforms merging models trained in parallel.

[65] The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

Alfio Ferrara,Sergio Picascia,Elisabetta Rocchetti

Main category: cs.CV

TL;DR: 该论文研究了基于Transformer的文本到图像扩散模型如何在生成艺术作品时编码内容与风格概念,发现模型在某种程度上能够区分内容与风格,但这种分离程度取决于特定的艺术提示和风格需求。

Details Motivation: 尽管文本到图像扩散模型在艺术内容生成方面表现出色,但模型内部如何表示内容与风格这样的概念仍是一个未探索的问题。传统计算机视觉假设内容与风格是正交的,但扩散模型在训练中并未收到关于这种区分的明确指导。

Contribution: 论文的主要贡献是通过交叉注意力热图分析生成图像中受内容描述词和风格描述词影响的区域,揭示了扩散模型在不同艺术提示和风格需求下的内容-风格分离能力。

Method: 作者利用交叉注意力热图对生成图像的像素进行特定提示词的归因,从而分离出内容描述词和风格描述词所影响的图像区域。

Result: 研究发现,扩散模型在生成艺术作品时表现出不同程度的内容-风格分离,内容词主要影响物体相关区域,而风格词则影响背景和纹理区域,表明模型对内容与风格的区分具有潜在理解。

Insight: 这项研究为理解大规模生成模型如何在无明确监督的情况下表示复杂的艺术概念提供了新视角,揭示了模型在艺术生成任务中的内在机制。

Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

[66] Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification

Vineet Kumar Rakesh,Soumya Mazumdar,Tapas Samanta,Sarbajit Pal,Amitabha Das

Main category: cs.CV

TL;DR: 这篇论文研究了超参数优化对轻量级深度学习模型在实时图像分类任务中精度的影响,通过实验分析了多种模型的性能表现,并提出了优化建议。

Details Motivation: 轻量级模型在资源受限的实时应用中至关重要,但超参数调整对其性能的影响尚未系统研究。本文旨在填补这一空白。

Contribution: 1. 分析了七种高效模型在ImageNet-1K上的超参数敏感性;2. 提出了优化建议(如余弦学习率衰减和动态批量大小);3. 公开了代码和训练日志。

Method: 在统一训练设置下,对七种模型进行超参数消融实验,评估了学习率、批量大小、输入分辨率等的影响,并通过GPU模拟边缘部署测试实时性能。

Result: 余弦学习率衰减和动态批量大小能显著提高精度和收敛速度。RepVGG-A2表现最优,Top-1精度超过80%,同时保持高效推理。

Insight: 优化超参数可以显著提升轻量级模型的实时性能,尤其是余弦学习率和批量大小调整对平衡精度与资源开销非常有效。

Abstract: Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at https://github.com/VineetKumarRakesh/lcnn-opt.

[67] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Jiajun Cao,Qizhe Zhang,Peidong Jia,Xuhui Zhao,Bo Lan,Xiaoan Zhang,Xiaobao Wei,Sixiang Chen,Zhuo Li,Yang Wang,Liyun Li,Xianming Liu,Ming Lu,Shanghang Zhang

Main category: cs.CV

TL;DR: FastDriveVLA提出了一种基于重建的视觉令牌剪枝框架,用于高效端到端自动驾驶,通过MAE风格像素重建和对抗性重建策略,显著降低了计算成本。

Details Motivation: 现有的视觉令牌剪枝方法在自动驾驶场景中表现不佳,因为驾驶员专注于前景区域,而现有方法未充分考虑这一点。

Contribution: 1. 提出FastDriveVLA框架和ReconPruner剪枝器;2. 设计对抗性前景-背景重建策略;3. 发布nuScenes-FG数据集。

Method: 利用MAE风格像素重建和对抗性策略训练ReconPruner,保留前景信息并剪枝令牌。

Result: 在nuScenes闭环规划基准测试中,该方法在不同剪枝比例下达到最佳性能。

Insight: 前景信息对自动驾驶决策至关重要,基于重建的剪枝策略能有效保留关键信息。

Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes closed-loop planning benchmark across different pruning ratios.

[68] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

Yiming Yang,Hongbin Lin,Yueru Luo,Suzhong Fu,Chao Zheng,Xinrui Yan,Shuqi Mei,Kun Tang,Shuguang Cui,Zhen Li

Main category: cs.CV

TL;DR: FASTopoWM是一种通过潜在世界模型增强的快速-慢速车道段拓扑推理框架,显著提升了车道检测与中心线感知性能。

Details Motivation: 现有车道拓扑推理方法未能有效利用时序信息,且易受位姿估计失败影响。FASTopoWM旨在通过潜在世界模型和并行监督解决这些问题。

Contribution: 1. 提出快速-慢速并行监督框架,减少位姿估计失败的影响;2. 引入基于动作潜变量的潜在查询与BEV世界模型,提升时序感知性能。

Method: 1. 快速与慢速系统并行监督;2. 潜在世界模型传播历史状态表示;3. 在OpenLane-V2基准测试中验证。

Result: 在车道段检测(mAP 37.4%)和中心线感知(OLS 46.3%)上优于现有方法。

Insight: 利用潜在世界模型和并行监督能显著提升时序感知能力,对自动驾驶系统具有重要价值。

Abstract: Lane segment topology reasoning provides comprehensive bird’s-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

[69] Learning Semantic Directions for Feature Augmentation in Domain-Generalized Medical Segmentation

Yingkai Wang,Yaoyao Zhu,Xiuding Cai,Yuhao Xiao,Haotian Wu,Yu Yao

Main category: cs.CV

TL;DR: 该论文提出了一种针对医学图像分割的领域泛化框架,通过引入隐式特征扰动和自适应一致性约束,提高了模型在未见临床领域中的分割性能。

Details Motivation: 医学图像分割在临床工作流中至关重要,但由于成像条件、扫描仪类型和采集协议的变化,模型在未见领域中表现下降。论文利用医学图像的解剖结构一致性,针对性解决了领域偏移问题。

Contribution: 1) 提出了领域泛化框架,通过隐式特征扰动提高鲁棒性;2) 设计了可学习的语义方向选择器和协方差语义强度采样器;3) 引入了自适应一致性约束以稳定特征选择。

Method: 采用语义方向选择器和协方差语义强度采样器调制领域变异特征,结合自适应一致性约束确保分割性能稳定。

Result: 在两个多中心公开基准测试中,该方法显著优于现有领域泛化方法,实现了跨临床领域的鲁棒分割性能。

Insight: 医学图像的解剖结构一致性为领域泛化提供了独特优势,通过特征扰动和自适应约束可以在保持任务相关一致性的同时应对领域变异。

Abstract: Medical image segmentation plays a crucial role in clinical workflows, but domain shift often leads to performance degradation when models are applied to unseen clinical domains. This challenge arises due to variations in imaging conditions, scanner types, and acquisition protocols, limiting the practical deployment of segmentation models. Unlike natural images, medical images typically exhibit consistent anatomical structures across patients, with domain-specific variations mainly caused by imaging conditions. This unique characteristic makes medical image segmentation particularly challenging. To address this challenge, we propose a domain generalization framework tailored for medical image segmentation. Our approach improves robustness to domain-specific variations by introducing implicit feature perturbations guided by domain statistics. Specifically, we employ a learnable semantic direction selector and a covariance-based semantic intensity sampler to modulate domain-variant features while preserving task-relevant anatomical consistency. Furthermore, we design an adaptive consistency constraint that is selectively applied only when feature adjustment leads to degraded segmentation performance. This constraint encourages the adjusted features to align with the original predictions, thereby stabilizing feature selection and improving the reliability of the segmentation. Extensive experiments on two public multi-center benchmarks show that our framework consistently outperforms existing domain generalization approaches, achieving robust and generalizable segmentation performance across diverse clinical domains.

[70] Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision

Qiang Lu,Waikit Xiu,Xiying Li,Shenyu Hu,Shengbo Sun

Main category: cs.CV

TL;DR: 论文提出了一种结合开放词汇检测和跨模态学习的两阶段框架,用于解决交通标志识别中的长尾分布和小目标多尺度特征提取问题,实现了在TT100K数据集上的最优性能。

Details Motivation: 当前交通标志识别技术的两大挑战是数据集的长尾分布和小目标的多尺度特征提取,这导致传统卷积网络对低频类和分布外类的识别性能下降。

Contribution: 提出了一种两阶段框架,包括用于检测的NanoVerse YOLO模型和用于分类的TSR-MCL模型,通过对比学习增强了特征的鲁棒性。

Method: 1. NanoVerse YOLO整合了RepVL-PAN和SPD-Conv模块,优化小目标检测;2. TSR-MCL通过视觉特征(Vision Transformer)和语义特征(规则化BERT)的对比学习提升分类性能。

Result: 在TT100K数据集上,模型取得了78.4%的mAP(长尾检测任务),分类准确率为91.8%,召回率为88.9%,显著优于主流算法。

Insight: 跨模态对比学习可以有效缓解数据不均衡带来的类别混淆问题,同时结合视觉与语义特征能够提升模型的泛化能力。

Abstract: Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale features.To overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module to specifically enhance feature extraction for small, multi-scale targets. For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL). By contrasting visual features from a Vision Transformer with semantic features from a rule-based BERT, TSR-MCL learns robust, frequency-independent representations, effectively mitigating class confusion caused by data imbalance. On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition. The model also obtains 91.8% accuracy and 88.9% recall, significantly outperforming mainstream algorithms and demonstrating superior accuracy and generalization in complex, open-world scenarios.

[71] MagicRoad: Semantic-Aware 3D Road Surface Reconstruction via Obstacle Inpainting

Xingyue Peng,Yuandong Lyu,Lang Zhang,Jian Zhu,Songtao Wang,Jiaxin Deng,Songxin Lu,Weiliang Ma,Dangen She,Peng Jia,XianPeng Lang

Main category: cs.CV

TL;DR: MagicRoad提出了一种语义感知的3D道路表面重建框架,通过障碍物修复和语义引导的颜色增强,提升了复杂城市环境下道路重建的鲁棒性和一致性。

Details Motivation: 现有方法在干净和静态环境下表现良好,但在动态遮挡、静态障碍物和光照变化等真实场景中表现不佳,因此需要一种更鲁棒的重建框架。

Contribution: 1. 提出了一种平面适应的2D高斯表面表示;2. 结合语义引导的视频修复技术去除动态和静态障碍物;3. 在HSV空间中实现语义感知的颜色增强。

Method: 1. 使用平面适应的高斯表面表示进行高效大尺度建模;2. 利用分割引导的视频修复去除遮挡;3. 在HSV空间中进行语义感知的颜色校正。

Result: 在城市规模数据集上,方法在视觉连贯性和几何精度上显著优于现有方法。

Insight: 语义信息的引入(如分割和颜色校正)是提升道路重建质量的关键,尤其在复杂环境下。

Abstract: Road surface reconstruction is essential for autonomous driving, supporting centimeter-accurate lane perception and high-definition mapping in complex urban environments.While recent methods based on mesh rendering or 3D Gaussian splatting (3DGS) achieve promising results under clean and static conditions, they remain vulnerable to occlusions from dynamic agents, visual clutter from static obstacles, and appearance degradation caused by lighting and weather changes. We present a robust reconstruction framework that integrates occlusion-aware 2D Gaussian surfels with semantic-guided color enhancement to recover clean, consistent road surfaces. Our method leverages a planar-adapted Gaussian representation for efficient large-scale modeling, employs segmentation-guided video inpainting to remove both dynamic and static foreground objects, and enhances color coherence via semantic-aware correction in HSV space. Extensive experiments on urban-scale datasets demonstrate that our framework produces visually coherent and geometrically faithful reconstructions, significantly outperforming prior methods under real-world conditions.

[72] The Impact of Image Resolution on Face Detection: A Comparative Analysis of MTCNN, YOLOv XI and YOLOv XII models

Ahmet Can Ömercikoğlu,Mustafa Mansur Yönügül,Pakize Erdoğmuş

Main category: cs.CV

TL;DR: 论文比较了MTCNN、YOLOv11和YOLOv12在不同分辨率下的面部检测性能,发现YOLOv11在高分辨率下表现最佳,YOLOv12召回率略高,而MTCNN在实时性上较差。

Details Motivation: 现实中的低分辨率图像对面部检测性能提出了挑战,需要研究分辨率对模型性能的影响。

Contribution: 系统评估了三种主流面部检测模型在不同分辨率下的表现,为实际应用提供了选择依据。

Method: 使用WIDER FACE数据集,通过多种分辨率(160x160至640x640)测试模型的精度、召回率、mAP50等指标。

Result: YOLOv11在高分辨率下表现最优,YOLOv12召回率较高,MTCNN实时性不足但地标定位效果好。

Insight: 分辨率对模型性能有显著影响,需根据实际需求选择合适模型。

Abstract: Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model’s performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.

[73] IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025

Radu-Andrei Bourceanu,Neil De La Fuente,Jan Grimm,Andrei Jardan,Andriy Manucharyan,Cornelius Weiss,Roman Pflugfelder

Main category: cs.CV

TL;DR: 该报告分析了计算机视觉中六篇影响力论文的关键设计模式演变,涵盖残差连接(ResNet)、视觉Transformer(ViT)、生成对抗网络(GANs)、潜在扩散模型(LDMs)以及自监督学习技术(DINO和MAE)。

Details Motivation: 探索计算机视觉领域设计模式的演变,从传统卷积网络到基于注意力的模型,再到生成模型和自监督学习技术,以理解技术进步的核心驱动力。

Contribution: 1. 系统回顾了计算机视觉中六篇关键论文的设计模式;2. 总结了从ResNet到MAE的架构创新与技术突破。

Method: 通过分析六篇代表性论文的核心方法:ResNet(残差连接)、ViT(视觉Transformer)、GANs(对抗训练)、LDMs(潜在空间去噪)、DINO(自蒸馏)、MAE(掩码自编码)。

Result: ResNet和ViT推动了视觉表征的发展,GANs和LDMs提升了生成模型的质量,DINO和MAE在减少标签依赖方面表现出色。

Insight: 1. 残差连接和注意力机制是深层网络训练的关键;2. 潜在扩散模型在生成任务中效率更高;3. 自监督学习为大规模模型预训练提供了新方向。

Abstract: This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analy- sis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recogni- tion. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.

[74] Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers

Ji Ma,Wei Suo,Peng Wang,Yanning Zhang

Main category: cs.CV

TL;DR: 这篇论文提出了Short-LVLM(SVL)框架,通过剪枝冗余层来压缩和加速大型视觉语言模型(LVLM),解决了直接应用NLP层剪枝技术无效的问题,实现了性能和效率的权衡。

Details Motivation: 大型视觉语言模型(LVLM)虽然表现出色,但其参数量和计算成本限制了实际应用。论文旨在探索一种无需训练的高效压缩方法。

Contribution: 1. 提出了Short-LVLM框架,利用重要视觉语言(VL)token并缓解层间特征差异;2. 发现直接应用NLP层剪枝技术无效的原因;3. 实现了无需训练、模型无关且高效的模型压缩。

Method: 通过分析非必要VL token和层间特征差距,提出Short-LVLM框架,结合token重要性评估和层间特征优化实现剪枝。

Result: Short-LVLM在性能和效率之间取得显著平衡,且在无需额外训练的情况下具有高度兼容性。

Insight: 视觉语言模型中的模态差异使得NLP剪枝技术直接迁移无效,而通过优化token利用和层间特征可以显著提升剪枝效果。

Abstract: Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.

[75] Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation

Haoran Chen,Zexiao Wang,Haidong Cao,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的多源无监督域自适应方法MP²A,通过渐进对齐策略减少噪声样本的影响,提升域不变特征学习。

Details Motivation: 现有方法在同时对齐所有伪标注数据时,易受噪声和难分类样本影响,导致误差传播和学习效果不佳,多源场景下问题更严重。

Contribution: 提出了MP²A方法,采用渐进对齐策略,逐步引入高难度样本,有效缓解确认偏差并提高域不变特征学习的鲁棒性。

Method: 首先训练模型于高置信度目标样本子集,逐步引入更具挑战性的样本,以渐进方式优化模型对齐能力。

Result: 在ImageCLEF、Office-Home和DomainNet基准测试中,MP²A取得了优于现有CLIP基多源无监督域自适应方法的表现。

Insight: 渐进对齐策略可显著减少噪声样本的负面影响,提升域自适应任务的稳定性和性能。

Abstract: Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach MP^2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that MP^2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.

[76] NeRF Is a Valuable Assistant for 3D Gaussian Splatting

Shuangkang Fang,I-Chao Shen,Takeo Igarashi,Yufeng Wang,ZeSheng Wang,Yi Yang,Wenrui Ding,Shuchang Zhou

Main category: cs.CV

TL;DR: 论文提出NeRF-GS框架,结合NeRF和3DGS的优势,通过联合优化提升3D场景表示性能。

Details Motivation: 解决3D高斯泼溅(3DGS)在高斯初始化敏感性、空间感知有限和高斯间关联弱等局限性。

Contribution: 提出NeRF-GS框架,通过NeRF的连续空间表达增强3DGS性能,并实现两者的互补优化。

Method: 联合优化NeRF和3DGS,设计共享3D空间信息的对齐方法,优化隐式特征和高斯位置的残差向量。

Result: 在基准数据集上表现优于现有方法,达到SOTA性能。

Insight: NeRF和3DGS是互补而非竞争的,为结合两者的混合方法提供了新思路。

Abstract: We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.

[77] AGA: An adaptive group alignment framework for structured medical cross-modal representation learning

Wei Li,Xun Gong,Jiao Li,Xiaobin Sun

Main category: cs.CV

TL;DR: AGA提出了一种自适应组对齐框架,通过双向分组机制和阈值门模块,解决了医疗领域跨模态表示学习中结构化语义捕获和小规模数据集对比学习的问题。

Details Motivation: 当前医疗视觉-语言预训练方法常将临床报告简化为单一实体或碎片化标记,忽略其内在结构,且对比学习依赖大量负样本,不适用于小规模医疗数据。

Contribution: 提出了自适应组对齐框架(AGA),通过双向分组机制和动态学习的阈值门模块,实现了结构化语义捕获和无需外部负样本的组内对齐。

Method: AGA计算图像-文本对的细粒度相似度,通过阈值门模块动态形成视觉组和语言组,并使用实例感知组对齐损失(IAGA)进行组内对齐。

Result: 在公开和私有数据集上,AGA在图像-文本检索和分类任务中表现出色,适用于微调和零样本场景。

Insight: 结构化语义的细粒度对齐在医疗跨模态学习中至关重要,动态阈值机制能有效适应小规模数据。

Abstract: Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and each patch selects its most related tokens to form a language group. To enable adaptive grouping, we design two threshold gating modules, called Language Grouped Threshold Gate and Vision Grouped Threshold Gate, which learn grouping thresholds dynamically. Group representations are computed as weighted averages based on similarity scores. To align each token with its group representation, we introduce an Instance Aware Group Alignment loss that operates within each image-text pair, removing the need for external negatives. Finally, a Bidirectional Cross-modal Grouped Alignment module is applied to enhance fine-grained alignment between visual and linguistic group representations. Extensive experiments on public and private datasets show that our method achieves strong performance on image-text retrieval and classification tasks under both fine-tuning and zero-shot settings.

[78] Out-of-Distribution Detection in Medical Imaging via Diffusion Trajectories

Lemar Abdi,Francisco Caetano,Amaan Valiuddin,Christiaan Viviers,Hamdi Joudeh,Fons van der Sommen

Main category: cs.CV

TL;DR: 本文提出了一种基于Stein分数去噪扩散模型(SBDDM)的无监督异常检测方法,通过仅使用5个扩散步骤的前向扩散轨迹实现了高效、准确的异常评分,显著降低了计算成本,并在多个医学影像OOD检测基准上实现了最优性能。

Details Motivation: 医学影像中,异常病例的发病率极低,而现有的生成式方法通常依赖于似然估计或重构误差,计算成本高且不可靠,尤其在数据分布变化时需重新训练。因此,迫切需要一种高效、鲁棒的无监督异常检测方法。

Contribution: 1. 提出了一种基于SBDDM的无需重构的OOD检测方法;2. 通过仅需5个扩散步骤的轨迹曲率实现了高效异常评分;3. 在多个近OOD和远OOD基准测试中超越了现有方法,相对性能提升最高达10.43%和18.10%。

Method: 利用预训练的SBDDM捕获前向扩散轨迹的曲率(通过估计的Stein分数),仅需5个扩散步骤即可实现准确的异常评分,无需重构输入数据。

Result: 在医学影像数据集上,SBDDM在近OOD和远OOD检测中实现了最优性能,计算成本显著降低,适用于实时计算机辅助诊断。

Insight: 扩散模型的轨迹曲率可作为高效的异常评分指标,为医学影像中的无监督异常检测提供了新的思路,同时展示了预训练模型在多任务中的泛化能力。

Abstract: In medical imaging, unsupervised out-of-distribution (OOD) detection offers an attractive approach for identifying pathological cases with extremely low incidence rates. In contrast to supervised methods, OOD-based approaches function without labels and are inherently robust to data imbalances. Current generative approaches often rely on likelihood estimation or reconstruction error, but these methods can be computationally expensive, unreliable, and require retraining if the inlier data changes. These limitations hinder their ability to distinguish nominal from anomalous inputs efficiently, consistently, and robustly. We propose a reconstruction-free OOD detection method that leverages the forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM). By capturing trajectory curvature via the estimated Stein score, our approach enables accurate anomaly scoring with only five diffusion steps. A single SBDDM pre-trained on a large, semantically aligned medical dataset generalizes effectively across multiple Near-OOD and Far-OOD benchmarks, achieving state-of-the-art performance while drastically reducing computational cost during inference. Compared to existing methods, SBDDM achieves a relative improvement of up to 10.43% and 18.10% for Near-OOD and Far-OOD detection, making it a practical building block for real-time, reliable computer-aided diagnosis.

[79] CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes

Bin Xie,Congxuan Zhang,Fagan Wang,Peng Liu,Feng Lu,Zhen Chen,Weiming Hu

Main category: cs.CV

TL;DR: 该论文提出了一个新的热红外数据集CST Anti-UAV,专注于复杂场景下的小型无人机(UAV)单目标跟踪(SOT),并评估了20种现有SOT方法的性能,结果凸显了当前技术的局限性。

Details Motivation: 当前无人机广泛应用引发安全和隐私问题,但现有的无人机跟踪数据集在场景复杂性和对象多样性上不足,难以满足实际需求。

Contribution: 提出首个针对复杂场景下小型无人机的热红外跟踪数据集(CST Anti-UAV),包含220个视频序列和24万个高质量标注框,并提供完整的逐帧属性标注。

Method: 通过收集并标注热红外视频数据,构建了一个具有多样复杂场景和小型无人机目标的数据集,并对20种现有SOT方法进行评估。

Result: 实验表明,现有最佳方法的跟踪准确率仅为35.92%,远低于其他数据集(如Anti-UAV410的67.69%),说明复杂场景下小型无人机跟踪仍具挑战性。

Insight: 该数据集揭示了现有跟踪技术的局限性,并推动开发更鲁棒的SOT方法,以提升反无人机系统的性能。

Abstract: The widespread application of Unmanned Aerial Vehicles (UAVs) has raised serious public safety and privacy concerns, making UAV perception crucial for anti-UAV tasks. However, existing UAV tracking datasets predominantly feature conspicuous objects and lack diversity in scene complexity and attribute representation, limiting their applicability to real-world scenarios. To overcome these limitations, we present the CST Anti-UAV, a new thermal infrared dataset specifically designed for Single Object Tracking (SOT) in Complex Scenes with Tiny UAVs (CST). It contains 220 video sequences with over 240k high-quality bounding box annotations, highlighting two key properties: a significant number of tiny-sized UAV targets and the diverse and complex scenes. To the best of our knowledge, CST Anti-UAV is the first dataset to incorporate complete manual frame-level attribute annotations, enabling precise evaluations under varied challenges. To conduct an in-depth performance analysis for CST Anti-UAV, we evaluate 20 existing SOT methods on the proposed dataset. Experimental results demonstrate that tracking tiny UAVs in complex environments remains a challenge, as the state-of-the-art method achieves only 35.92% state accuracy, much lower than the 67.69% observed on the Anti-UAV410 dataset. These findings underscore the limitations of existing benchmarks and the need for further advancements in UAV tracking research. The CST Anti-UAV benchmark is about to be publicly released, which not only fosters the development of more robust SOT methods but also drives innovation in anti-UAV systems.

[80] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

Ting Huang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 3D-R1通过高质量数据合成、强化学习策略和动态视角选择,提升了3D视觉语言模型的推理能力和场景理解泛化性。

Details Motivation: 现有3D VLMs在推理和泛化能力上存在不足,主要受限于高质量空间数据缺乏和静态视角假设。

Contribution: 提出3D-R1框架,包括Scene-30K合成数据集、GRPO强化学习策略和动态视角选择方法。

Method: 构建Scene-30K数据集;利用GRPO策略和三种奖励函数增强推理;动态视角选择优化场景理解。

Result: 在多个3D场景基准测试中平均提升10%。

Insight: 高质量数据和动态视角选择对3D推理任务至关重要;强化学习能有效提升模型语义和检测精度。

Abstract: Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.

[81] Seeing More with Less: Video Capsule Endoscopy with Multi-Task Learning

Julia Werner,Oliver Bause,Julius Oexle,Maxime Le Floch,Franz Brinkmann,Jochen Hampe,Oliver Bringmann

Main category: cs.CV

TL;DR: 论文提出了一种多任务学习模型,用于胶囊内窥镜的实时定位与异常检测,旨在解决设备电池续航短和数据稀疏问题,模型仅需100万参数即可超越现有基线。

Details Motivation: 胶囊内窥镜的电池续航有限且数据稀疏,传统单任务模型难以满足实时决策需求。

Contribution: 提出了一种多任务神经网络,将定位与异常检测整合到单一模型中,模型规模小(100万参数),性能优于现有基线。

Method: 结合多任务学习方法与Viterbi解码技术,开发了能在小型胶囊内部署的轻量级模型。

Result: 在定位任务上准确率达93.63%,异常检测任务达87.48%,参数仅需100万。

Insight: 多任务学习能有效利用有限资源,提升胶囊内窥镜的智能决策能力,为医疗边缘设备提供了新思路。

Abstract: Video capsule endoscopy has become increasingly important for investigating the small intestine within the gastrointestinal tract. However, a persistent challenge remains the short battery lifetime of such compact sensor edge devices. Integrating artificial intelligence can help overcome this limitation by enabling intelligent real-time decision- making, thereby reducing the energy consumption and prolonging the battery life. However, this remains challenging due to data sparsity and the limited resources of the device restricting the overall model size. In this work, we introduce a multi-task neural network that combines the functionalities of precise self-localization within the gastrointestinal tract with the ability to detect anomalies in the small intestine within a single model. Throughout the development process, we consistently restricted the total number of parameters to ensure the feasibility to deploy such model in a small capsule. We report the first multi-task results using the recently published Galar dataset, integrating established multi-task methods and Viterbi decoding for subsequent time-series analysis. This outperforms current single-task models and represents a significant ad- vance in AI-based approaches in this field. Our model achieves an accu- racy of 93.63% on the localization task and an accuracy of 87.48% on the anomaly detection task. The approach requires only 1 million parameters while surpassing the current baselines.

[82] Online Estimation of Table-Top Grown Strawberry Mass in Field Conditions with Occlusions

Jinshan Zhen,Yuanyue Ge,Tianxiao Zhu,Hui Zhao,Ya Xiong

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉的方法,结合RGB-D感知和深度学习,用于实时在线估计在遮挡条件下种植的草莓质量,解决了传统方法的局限性。

Details Motivation: 在田间条件下,草莓的质量估计因频繁的遮挡和姿态变化而具有挑战性,需要一种非破坏性、实时且鲁棒性强的解决方案。

Contribution: 提出了一种集成YOLOv8-Seg实例分割、CycleGAN遮挡修复和倾斜角度校正的流水线,实现了遮挡条件下的草莓质量在线估计。

Method: 采用YOLOv8-Seg进行实例分割,CycleGAN修复遮挡区域,通过倾斜角度校正优化投影面积计算,最后使用多项式回归模型将几何特征映射到质量。

Result: 实验显示,孤立草莓的平均质量估计误差为8.11%,遮挡情况下为10.47%;CycleGAN在遮挡修复中优于LaMa模型。

Insight: CycleGAN在遮挡修复中表现出色,为复杂遮挡条件下的自动化收获和产量监测提供了新思路。

Abstract: Accurate mass estimation of table-top grown strawberries under field conditions remains challenging due to frequent occlusions and pose variations. This study proposes a vision-based pipeline integrating RGB-D sensing and deep learning to enable non-destructive, real-time and online mass estimation. The method employed YOLOv8-Seg for instance segmentation, Cycle-consistent generative adversarial network (CycleGAN) for occluded region completion, and tilt-angle correction to refine frontal projection area calculations. A polynomial regression model then mapped the geometric features to mass. Experiments demonstrated mean mass estimation errors of 8.11% for isolated strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask inpainting (LaMa) model in occlusion recovery, achieving superior pixel area ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU) scores (92.3% vs. 47.7% in the [0.9-1] range). This approach addresses critical limitations of traditional methods, offering a robust solution for automated harvesting and yield monitoring with complex occlusion patterns.

[83] Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion

Timing Li,Bing Cao,Jiahe Feng,Haifang Cao,Qinghau Hu,Pengfei Zhu

Main category: cs.CV

TL;DR: 本文提出了一种基于双曲空间的跨模态图像对齐方法Hy-CycleAlign,通过双路径循环注册框架和双曲层次对比对齐模块,显著提升了多模态图像的对齐和融合质量。

Details Motivation: 现有的基于欧几里得空间的图像注册方法在跨模态对齐上效果不佳,限制了多源数据融合的性能。为解决这一问题,作者探索了非欧几里得空间(双曲空间)中的图像对齐方法。

Contribution: 1. 提出了首个基于双曲空间的图像注册方法Hy-CycleAlign;2. 设计了双路径循环注册框架和双曲层次对比对齐模块(H$^{2}$CA);3. 证明了双曲空间在多模态图像对齐中的优越性。

Method: 1. 双路径循环注册框架:前向路径对齐跨模态输入,后向路径重建原始图像,形成闭环结构;2. H$^{2}$CA模块:将图像映射到双曲空间并施加注册约束,减少模态差异的干扰。

Result: 在实验中,Hy-CycleAlign显著优于现有方法,实现了更高质量的多模态图像对齐和融合。

Insight: 双曲空间比欧几里得空间更适合处理跨模态图像的几何和模态差异,为多模态数据对齐提供了新的思路。

Abstract: Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment and fusion quality. To overcome this limitation, we explore image alignment in non-Euclidean space and propose a Hyperbolic Cycle Alignment Network (Hy-CycleAlign). To the best of our knowledge, Hy-CycleAlign is the first image registration method based on hyperbolic space. It introduces a dual-path cross-modal cyclic registration framework, in which a forward registration network aligns cross-modal inputs, while a backward registration network reconstructs the original image, forming a closed-loop registration structure with geometric consistency. Additionally, we design a Hyperbolic Hierarchy Contrastive Alignment (H$^{2}$CA) module, which maps images into hyperbolic space and imposes registration constraints, effectively reducing interference caused by modality discrepancies. We further analyze image registration in both Euclidean and hyperbolic spaces, demonstrating that hyperbolic space enables more sensitive and effective multi-modal image registration. Extensive experiments on misaligned multi-modal images demonstrate that our method significantly outperforms existing approaches in both image alignment and fusion. Our code will be publicly available.

[84] I Am Big, You Are Little; I Am Right, You Are Wrong

David A. Kelly,Akchunya Chanchal,Nathan Blake

Main category: cs.CV

TL;DR: 该研究通过分析图像分类模型的最小足够像素集(minimal sufficient pixels sets),揭示了不同架构模型在决策过程中关注的像素区域差异(如ConvNext和EVA与其他模型的显著不同),并发现误分类图像通常需要更大的像素集。

Details Motivation: 随着图像分类器数量和架构的多样化,选择合适模型变得至关重要,但对其决策机制的理解有限。为深入了解不同模型的决策过程,研究提出通过最小足够像素集来分析模型的‘注意力集中度’。

Contribution: 提出使用最小足够像素集量化模型的‘注意力集中度’,并揭示了不同架构模型在像素集大小和位置上的统计差异。此外,发现误分类图像需要更大像素集。

Method: 研究通过生成和比较图像的最小足够像素集,分析其位置、重叠度和大小,以衡量不同模型的‘集中度’。

Result: 发现ConvNext和EVA模型与其他模型在像素集大小和位置上具有显著差异,且误分类图像的像素集通常更大。

Insight: 研究结果表明,模型架构直接影响其对图像关键区域的关注方式,误分类可能源于模型需要更多信息来做出决策。这为模型选择和优化提供了新视角。

Abstract: Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model’s classification accuracy statistically, our understanding of the way these models work is unfortunately limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets to gauge a model’s `concentration’: the pixels that capture the essence of an image through the lens of the model. By comparing position, overlap, and size of sets of pixels, we identify that different architectures have statistically different concentration, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with larger pixels sets than correct classifications.

[85] ART: Adaptive Relation Tuning for Generalized Relation Prediction

Gopika Sudhakaran,Hikaru Shindo,Patrick Schramowski,Simone Schaub-Meyer,Kristian Kersting,Stefan Roth

Main category: cs.CV

TL;DR: ART是一种自适应关系调整框架,通过指令调优和策略性实例选择,将视觉语言模型(VLM)适配于视觉关系检测(VRD)任务,提升了模型的泛化能力。

Details Motivation: 传统的VRD模型依赖手工提示,难以处理新关系或复杂关系,限制了泛化能力。而指令调优能更好地适应多样化的关系数据。

Contribution: 提出ART框架,通过指令调优和自适应采样算法,使VLM专注于信息丰富的关系,同时保持泛化能力,尤其在分类未见过的关系时表现出色。

Method: 将VRD数据集转换为指令调优格式,采用自适应采样算法选择关键实例。在给定主客框的条件下,预测谓语关系,并在多样化的测试集上验证。

Result: ART显著超越基线方法,并能推理未见过的关系概念,还能用于复杂场景的分割任务。

Insight: 指令调优是提升VRD模型泛化能力的有力工具,自适应采样有助于模型聚焦关键关系,同时避免过拟合。

Abstract: Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART’s practical value by using the predicted relations for segmenting complex scenes.

[86] Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi,Mohamed Ilyas Lakhal,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 论文提出了一个名为BeyondGloss的免手语词典框架,通过使用视频大语言模型(VideoLLMs)的时空推理能力,结合对手部动作的细粒度文本描述和对齐模块,提升了手语翻译的性能,并在多个基准测试中达到最先进水平。

Details Motivation: 手语翻译(SLT)面临模态差异和细粒度手部动作捕捉的挑战,现有视频大语言模型难以处理长视频细节,因此需要一种新方法来生成细粒度且时序敏感的文本描述。

Contribution: 1. 提出第一个免手语词典的SLT框架BeyondGloss;2. 设计了细粒度的时序文本描述生成方法;3. 引入了对比对齐模块和HaMeR特征蒸馏以增强手部特征;4. 通过对比损失缩小模态差异。

Method: 1. 利用VideoLLMs的时空推理能力生成细粒度手部动作文本描述;2. 使用对比对齐模块对齐视频特征与文本;3. 从HaMeR蒸馏细粒度特征;4. 通过对比损失优化预训练。

Result: 在Phoenix14T和CSL-Daily基准测试中达到最先进水平,证明了框架的有效性。

Insight: 免手语词典的方法更符合实际应用需求,细粒度文本描述和对比对齐模块是关键创新,可能为SLT和其他时序任务提供新思路。

Abstract: Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. \textbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.

[87] Mamba-based Efficient Spatio-Frequency Motion Perception for Video Camouflaged Object Detection

Xin Li,Keren Fu,Qijun Zhao

Main category: cs.CV

TL;DR: 该论文提出了一种基于Mamba的高效时空频率运动感知方法(Vcamba),用于视频伪装目标检测(VCOD)。通过结合空间和频率特征,Vcamba显著提升了检测的准确性和完整性。

Details Motivation: 现有VCOD方法主要依赖空间外观特征感知运动线索,但由于前景和背景高度相似,空间特征的区分性有限。频率特征和Mamba模型的引入可以弥补这一不足,提高检测性能。

Contribution: 1. 提出Vcamba模型,整合空间和频率特征以实现高效的VCOD;2. 设计了RFVSS模块提取多尺度空间特征;3. 引入AFE模块进行频率学习;4. 提出SLMP和FLMP模块分别建模空间和频率域的长程运动;5. 通过SFMF模块融合双域特征。

Method: 1. 使用RFVSS模块进行序列建模和空间特征提取;2. AFE模块通过频率域顺序扫描策略增强频率特征;3. SLMP和FLMP模块分别建模空间和频率域的运动;4. SFMF模块融合双域特征。

Result: 实验结果表明,Vcamba在6个评估指标和2个数据集上均优于现有方法,同时计算成本更低。

Insight: 频率特征和Mamba模型的结合有效解决了VCOD中空间特征区分性不足的问题,同时提供了高效的长序列建模能力。

Abstract: Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearance features to perceive motion cues for breaking camouflage. However, the high similarity between foreground and background in VCOD results in limited discriminability of spatial appearance features (e.g., color and texture), restricting detection accuracy and completeness. Recent studies demonstrate that frequency features can not only enhance feature representation to compensate for appearance limitations but also perceive motion through dynamic variations in frequency energy. Furthermore, the emerging state space model called Mamba, enables efficient perception of motion cues in frame sequences due to its linear-time long-sequence modeling capability. Motivated by this, we propose a novel visual camouflage Mamba (Vcamba) based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, we propose a receptive field visual state space (RFVSS) module to extract multi-scale spatial features after sequence modeling. For frequency learning, we introduce an adaptive frequency component enhancement (AFE) module with a novel frequency-domain sequential scanning strategy to maintain semantic consistency. Then we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences in spatial and frequency phase domains. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features for unified motion representation. Experimental results show that our Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming the superiority of Vcamba. Our code is available at: https://github.com/BoydeLi/Vcamba.

[88] Medical Image De-Identification Benchmark Challenge

Linmin Pei,Granger Sutton,Michael Rutherford,Ulrike Wagner,Tracy Nolan,Kirk Smith,Phillip Farmer,Peter Gu,Ambar Rana,Kailing Chen,Thomas Ferleman,Brian Park,Ye Wu,Jordan Kojouharov,Gargi Singh,Jon Lemon,Tyler Willis,Milos Vukadinovic,Grant Duffy,Bryan He,David Ouyang,Marco Pereanez,Daniel Samber,Derek A. Smith,Christopher Cannistraci,Zahi Fayad,David S. Mendelson,Michele Bufano,Elmar Kotter,Hamideh Haghiri,Rajesh Baidya,Stefan Dvoretskii,Klaus H. Maier-Hein,Marco Nolden,Christopher Ablett,Silvia Siggillino,Sandeep Kaushik,Hongzhu Jiang,Sihan Xie,Zhiyu Wan,Alex Michie,Simon J Doran,Angeline Aurelia Waly,Felix A. Nathaniel Liang,Humam Arshad Mustagfirin,Michelle Grace Felicia,Kuo Po Chih,Rahul Krish,Ghulam Rasool,Nidhal Bouaynaya,Nikolas Koutsoubis,Kyle Naddeo,Kartik Pandit,Tony O’Sullivan,Raj Krish,Qinyan Pan,Scott Gustafson,Benjamin Kopchick,Laura Opsahl-Ong,Andrea Olvera-Morales,Jonathan Pinney,Kathryn Johnson,Theresa Do,Juergen Klenk,Maria Diaz,Arti Singh,Rong Chai,David A. Clunie,Fred Prior,Keyvan Farahani

Main category: cs.CV

TL;DR: 论文介绍了医疗图像去标识化基准挑战(MIDI-B),旨在通过标准化平台评估基于HIPAA标准的去标识化工具,使用合成PHI/PII的多中心数据集,结果显示参与者表现优异。

Details Motivation: 医疗图像共享需符合患者隐私法规,同时需保留非PHI元数据以支持AI研究。MIDI-B挑战旨在为去标识化工具提供标准化评估平台。

Contribution: 1. 设计了符合HIPAA Safe Harbor和DICOM标准的去标识化基准挑战;2. 提供了多中心、多模态的合成PHI/PII数据集;3. 总结了去标识化工具的表现和最佳实践。

Method: 1. 挑战分为训练、验证和测试三阶段;2. 使用合成标识的真实去标识化数据集;3. 参与者采用开源/专有工具、大语言模型和OCR技术完成任务。

Result: 10支团队成功完成测试,得分范围为97.91%至99.93%,证明了规则方法的有效性。

Insight: 1. 标准化基准对评估去标识化工具有重要意义;2. 多种技术(如OCR)在去标识化任务中表现良好;3. 挑战为未来隐私保护研究提供了参考。

Abstract: The de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images, particularly through public repositories, to ensure compliance with patient privacy laws. In addition, preservation of non-PHI metadata to inform and enable downstream development of imaging artificial intelligence (AI) is an important consideration in biomedical research. The goal of MIDI-B was to provide a standardized platform for benchmarking of DICOM image deID tools based on a set of rules conformant to the HIPAA Safe Harbor regulation, the DICOM Attribute Confidentiality Profiles, and best practices in preservation of research-critical metadata, as defined by The Cancer Imaging Archive (TCIA). The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted. The MIDI-B Challenge consisted of three phases: training, validation, and test. Eighty individuals registered for the challenge. In the training phase, we encouraged participants to tune their algorithms using their in-house or public data. The validation and test phases utilized the DICOM images containing synthetic identifiers (of 216 and 322 subjects, respectively). Ten teams successfully completed the test phase of the challenge. To measure success of a rule-based approach to image deID, scores were computed as the percentage of correct actions from the total number of required actions. The scores ranged from 97.91% to 99.93%. Participants employed a variety of open-source and proprietary tools with customized configurations, large language models, and optical character recognition (OCR). In this paper we provide a comprehensive report on the MIDI-B Challenge’s design, implementation, results, and lessons learned.

[89] Consistent Point Matching

Halid Ziya Yerebakan,Gerardo Hermosillo Valadez

Main category: cs.CV

TL;DR: 本文提出了一种将一致性启发式方法融入点匹配算法的技术,显著提升了医学图像中解剖结构匹配的鲁棒性,并在多个数据集上取得了优于现有方法的结果。

Details Motivation: 医学图像中解剖结构的精确匹配对于临床决策至关重要。现有的点匹配方法在鲁棒性和效率方面仍有改进空间。

Contribution: 1. 提出了一种结合一致性启发式的点匹配算法,提高了匹配的鲁棒性。2. 在多种医学图像模态(CT和MRI)和数据集(包括Deep Lesion Tracking)上验证了方法的有效性。3. 展示了方法在无需机器学习模型或训练数据的情况下实现高精度导航的能力。

Method: 算法通过引入一致性启发式来改进点匹配过程,支持在标准CPU硬件上高效运行,并允许用户在速度和鲁棒性之间进行灵活权衡。

Result: 方法在Deep Lesion Tracking数据集上超越了现有技术的最佳结果,同时在多种模态的数据集上表现出色。

Insight: 一致性启发式显著提升了点匹配的鲁棒性,且无需依赖机器学习,适用于资源受限的环境。

Abstract: This study demonstrates that incorporating a consistency heuristic into the point-matching algorithm \cite{yerebakan2023hierarchical} improves robustness in matching anatomical locations across pairs of medical images. We validated our approach on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Notably, it surpasses state-of-the-art results on the Deep Lesion Tracking dataset. Additionally, we show that the method effectively addresses landmark localization. The algorithm operates efficiently on standard CPU hardware and allows configurable trade-offs between speed and robustness. The method enables high-precision navigation between medical images without requiring a machine learning model or training data.

[90] DivControl: Knowledge Diversion for Controllable Image Generation

Yucheng Xie,Fu Feng,Ruixiao Shi,Jing Wang,Yong Rui,Xin Geng

Main category: cs.CV

TL;DR: DivControl提出了一种基于知识分散的可分解预训练框架,用于统一可控图像生成和高效适应,通过SVD分解ControlNet并结合动态门控实现零样本泛化和参数高效适应。

Details Motivation: 现有方法在可控图像生成中通常需要为每个条件训练单独模型或依赖统一但耦合的架构,导致泛化能力差和适应成本高。DivControl旨在解决这一问题。

Contribution: 提出了可分解的预训练框架DivControl,通过SVD分解和知识分散技术实现了对ControlNet的模块化解耦,支持零样本泛化和高效适应。

Method: 将ControlNet通过SVD分解为基本组件,结合动态门控实现知识分散,并通过表示对齐损失提升条件保真度和训练效率。

Result: DivControl在训练成本降低36.4倍的同时,实现了最先进的生成可控性,并在未见条件上表现出强大的零样本和少样本性能。

Insight: 知识分散和模块化解耦是提升可控生成模型泛化能力和适应效率的关键。

Abstract: Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.

[91] SAMSA: Segment Anything Model Enhanced with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

Alfie Roddan,Tobias Czempiel,Chi Xu,Daniel S. Elson,Stamatia Giannarou

Main category: cs.CV

TL;DR: SAMSA 是一种结合 RGB 基础模型和光谱分析的交互式分割框架,解决了高光谱医学图像分割中的数据限制和硬件差异问题。

Details Motivation: 高光谱成像在医学图像中提供丰富的光谱信息,但数据不足和硬件差异导致分割任务极具挑战性。

Contribution: 提出 SAMSA 框架,通过用户点击交互引导 RGB 分割和光谱相似性计算,实现多光谱特征融合。

Method: 结合 RGB 基础模型和光谱角度分析,通过用户点击优化分割,提出独立于光谱数量和分辨率的光谱特征融合策略。

Result: 在公开数据集上达到 81.0% 1-click 和 93.4% 5-click DICE(神经外科),以及 81.1% 1-click 和 89.2% 5-click DICE(猪体内手术)。

Insight: SAMSA 在少样本和零样本学习场景中表现优异,适用于具有不同光谱特性的数据集,为高光谱医学图像分析提供灵活框架。

Abstract: Hyperspectral imaging (HSI) provides rich spectral information for medical imaging, yet encounters significant challenges due to data limitations and hardware variations. We introduce SAMSA, a novel interactive segmentation framework that combines an RGB foundation model with spectral analysis. SAMSA efficiently utilizes user clicks to guide both RGB segmentation and spectral similarity computations. The method addresses key limitations in HSI segmentation through a unique spectral feature fusion strategy that operates independently of spectral band count and resolution. Performance evaluation on publicly available datasets has shown 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine hyperspectral dataset. Experimental results demonstrate SAMSA’s effectiveness in few-shot and zero-shot learning scenarios and using minimal training examples. Our approach enables seamless integration of datasets with different spectral characteristics, providing a flexible framework for hyperspectral medical image analysis.

[92] I2V-GS: Infrastructure-to-Vehicle View Transformation with Gaussian Splatting for Autonomous Driving Data Generation

Jialei Chen,Wuhao Xu,Sipeng He,Baoru Huang,Dongchun Ren

Main category: cs.CV

TL;DR: 这篇论文提出了I2V-GS方法,通过高斯泼溅技术实现基础设施到车辆视角的转换,用于生成自动驾驶数据,并引入RoadSight数据集。实验表明,该方法在合成质量和指标上显著优于现有技术。

Details Motivation: 自动驾驶系统需要大量高质量数据,但现有的车辆采集方式成本高且效率低。从基础设施视角合成车辆视角数据成为一种潜在解决方案。

Contribution: 1. 提出了I2V-GS方法,首次实现基础设施到车辆视角的转换;2. 引入RoadSight数据集,支持多模态、多视角数据;3. 在合成质量和指标上显著提升。

Method: 采用自适应深度变形生成密集训练视图,通过级联策略填补变形图像,并利用跨视角信息进行置信引导的优化。

Result: I2V-GS在NTA-Iou、NTL-Iou和FID指标上分别比StreetGaussian提高了45.7%、34.2%和14.9%。

Insight: 基础设施视角数据可以高效合成车辆视角数据,为自动驾驶数据生成提供新思路。

Abstract: Vast and high-quality data are essential for end-to-end autonomous driving systems. However, current driving data is mainly collected by vehicles, which is expensive and inefficient. A potential solution lies in synthesizing data from real-world images. Recent advancements in 3D reconstruction demonstrate photorealistic novel view synthesis, highlighting the potential of generating driving data from images captured on the road. This paper introduces a novel method, I2V-GS, to transfer the Infrastructure view To the Vehicle view with Gaussian Splatting. Reconstruction from sparse infrastructure viewpoints and rendering under large view transformations is a challenging problem. We adopt the adaptive depth warp to generate dense training views. To further expand the range of views, we employ a cascade strategy to inpaint warped images, which also ensures inpainting content is consistent across views. To further ensure the reliability of the diffusion model, we utilize the cross-view information to perform a confidenceguided optimization. Moreover, we introduce RoadSight, a multi-modality, multi-view dataset from real scenarios in infrastructure views. To our knowledge, I2V-GS is the first framework to generate autonomous driving datasets with infrastructure-vehicle view transformation. Experimental results demonstrate that I2V-GS significantly improves synthesis quality under vehicle view, outperforming StreetGaussian in NTA-Iou, NTL-Iou, and FID by 45.7%, 34.2%, and 14.9%, respectively.

[93] Enhanced Velocity Field Modeling for Gaussian Video Reconstruction

Zhenyang Li,Xiaoyang Bai,Tongchen Zhang,Pengfei Shen,Weiwei Xu,Yifan Peng

Main category: cs.CV

TL;DR: FlowGaussian-VR提出了一种针对高斯视频重建的增强速度场建模方法,通过光学流优化和自适应高斯分布调整,显著提升了动态场景的视觉质量和轨迹跟踪能力。

Details Motivation: 当前基于变形场的高斯重建方法在复杂运动和尺度变化场景中表现不佳,高斯轨迹易过拟合,且静态方法的梯度密集化策略无法满足动态内容需求。

Contribution: 1. 提出了FlowGaussian-VR框架,结合速度场渲染(VFR)和流辅助自适应密集化(FAD);2. 实现了基于光学流的优化和动态区域高斯分布的智能调整。

Method: 1. VFR管道利用光学流优化;2. FAD策略动态调节高斯分布的数量和大小。

Result: 在多视角动态重建和新视角合成任务中,PSNR提升2.5 dB以上,动态纹理模糊减少,高斯轨迹更规则且可跟踪。

Insight: 通过结合光学流和高斯自适应分布,能有效解决复杂运动中轨迹过拟合和内容缺失问题,显著提升视频重建质量。

Abstract: High-fidelity 3D video reconstruction is essential for enabling real-time rendering of dynamic scenes with realistic motion in virtual and augmented reality (VR/AR). The deformation field paradigm of 3D Gaussian splatting has achieved near-photorealistic results in video reconstruction due to the great representation capability of deep deformation networks. However, in videos with complex motion and significant scale variations, deformation networks often overfit to irregular Gaussian trajectories, leading to suboptimal visual quality. Moreover, the gradient-based densification strategy designed for static scene reconstruction proves inadequate to address the absence of dynamic content. In light of these challenges, we propose a flow-empowered velocity field modeling scheme tailored for Gaussian video reconstruction, dubbed FlowGaussian-VR. It consists of two core components: a velocity field rendering (VFR) pipeline which enables optical flow-based optimization, and a flow-assisted adaptive densification (FAD) strategy that adjusts the number and size of Gaussians in dynamic regions. We validate our model’s effectiveness on multi-view dynamic reconstruction and novel view synthesis with multiple real-world datasets containing challenging motion scenarios, demonstrating not only notable visual improvements (over 2.5 dB gain in PSNR) and less blurry artifacts in dynamic textures, but also regularized and trackable per-Gaussian trajectories.

[94] DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching

Emery Pierson,Lei Li,Angela Dai,Maks Ovsjanikov

Main category: cs.CV

TL;DR: DiffuMatch提出了一种基于谱扩散先验的数据驱动方法,用于非刚性形状匹配,通过生成模型在谱域中训练功能映射,替代了传统的基于公理的正则化策略。

Details Motivation: 传统非刚性形状匹配方法依赖于功能映射的公理化建模,限制了方法的准确性和适用性。本文旨在通过数据驱动的方式在谱域中学习功能映射的先验知识,以提升匹配的鲁棒性。

Contribution: 首次提出在功能映射的网络正则化和训练中完全使用数据驱动方法,通过谱域的生成模型学习高质量功能映射的结构特性,实现了类别无关的匹配。

Method: 利用基于分数的生成模型在谱域中训练功能映射的生成模型,并通过一种新颖的谱域扩散模型蒸馏策略,学习功能映射的结构特性。

Result: 实验表明,该方法在零样本非刚性形状匹配任务中表现优于传统的公理化方法。

Insight: 通过数据驱动的方式学习功能映射的谱域先验,可以摆脱对公理化模型的依赖,提升匹配的泛化能力和准确性。

Abstract: Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/

[95] RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

Dongming Wu,Yanping Fu,Saike Huang,Yingfei Liu,Fan Jia,Nian Liu,Feng Dai,Tiancai Wang,Rao Muhammad Anwer,Fahad Shahbaz Khan,Jianbing Shen

Main category: cs.CV

TL;DR: RAGNet 是一个基于推理的大规模抓取导向的功能分割基准,包含 273k 图像和 26k 指令,提出 AffordanceNet 框架,结合视觉语言模型和抓取网络,提升了开放世界的泛化能力。

Details Motivation: 当前机器人抓取系统缺乏基于推理的大规模功能数据,限制了开放世界的适用性。需要构建一个包含多样场景和人类指令的基准。

Contribution: 1) 构建了 RAGNet 基准,包含 273k 图像和 26k 推理指令;2) 提出了 AffordanceNet 框架,结合了视觉语言模型和抓取网络。

Method: AffordanceNet 包括两部分:1) 在功能数据上预训练的视觉语言模型(VLM);2) 基于功能图的抓取网络。

Result: 在功能分割基准和真实机器人任务中表现优异,展现了强大的开放世界泛化能力。

Insight: 通过语言指令和功能图的结合,可以显著提升机器人抓取系统的开放世界适应能力。

Abstract: General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.

[96] Slot Attention with Re-Initialization and Self-Distillation

Rongzhen Zhao,Yi Zhao,Juho Kannala,Joni Pajarinen

Main category: cs.CV

TL;DR: 论文提出了DIAS方法,通过重新初始化和自蒸馏改进Slot Attention,减少冗余并提升对象表示效果,在对象发现和识别任务上达到SOTA。

Details Motivation: 现有的Object-Centric Learning(OCL)方法中,Slot Attention的槽位初始后直接复用,导致冗余槽位与有效槽位竞争,对象被错误分割。此外,监督信号仅来自槽位解码重建输入,忽略了内部信息的潜在监督。

Contribution: 1) 提出重新初始化机制减少冗余槽位;2) 引入自蒸馏方法改进注意力图;3) 在对象发现和识别任务上实现SOTA性能。

Method: DIAS方法结合槽位重新初始化和自蒸馏。重新初始化减少冗余并更新剩余槽位;自蒸馏通过驱动初始注意力图接近最终注意力图实现优化。

Result: DIAS在对象发现和识别任务中表现优异,并提升了高级视觉预测和推理能力。

Insight: 槽位的动态更新和自蒸馏能够显著提升对象中心学习的性能,减少冗余和错误分割问题。

Abstract: Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input’s reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on https://github.com/Genera1Z/DIAS.

[97] SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting

Di Li,Jie Feng,Jiahao Chen,Weisheng Dong,Guanbin Li,Yuhui Zheng,Mingtao Feng,Guangming Shi

Main category: cs.CV

TL;DR: 本文提出了SeqAffordSplat,一个支持3D高斯泼溅(3DGS)环境下长视野功能区域推理的大规模基准,并提出了SeqSplatNet框架,结合大语言模型和条件解码器实现多步任务的功能掩码预测。

Details Motivation: 现有的3D功能区域推理方法局限于单对象单步交互,无法应对复杂现实任务中的多对象长视野需求。

Contribution: 1. 提出新的任务Sequential 3D Gaussian Affordance Reasoning;2. 构建SeqAffordSplat基准(1800+场景);3. 提出SeqSplatNet框架,结合LLM和条件解码器;4. 引入预训练策略和语义特征融合机制。

Method: SeqSplatNet采用大语言模型自回归生成文本和分割标记,指导条件解码器生成3D掩码。还设计了预训练策略(Conditional Geometric Reconstruction)和从2D视觉基础模型提取语义特征的注入机制。

Result: 实验表明,该方法在提出的基准上取得了最先进性能,成功将功能区域推理从单步扩展至场景级多步任务。

Insight: 结合LLM和3D视觉模型可以有效处理复杂场景下的长视野功能推理,预训练和语义融合是提升性能的关键。

Abstract: 3D affordance reasoning, the task of associating human instructions with the functional regions of 3D objects, is a critical capability for embodied agents. Current methods based on 3D Gaussian Splatting (3DGS) are fundamentally limited to single-object, single-step interactions, a paradigm that falls short of addressing the long-horizon, multi-object tasks required for complex real-world applications. To bridge this gap, we introduce the novel task of Sequential 3D Gaussian Affordance Reasoning and establish SeqAffordSplat, a large-scale benchmark featuring 1800+ scenes to support research on long-horizon affordance understanding in complex 3DGS environments. We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks. SeqSplatNet employs a large language model that autoregressively generates text interleaved with special segmentation tokens, guiding a conditional decoder to produce the corresponding 3D mask. To handle complex scene geometry, we introduce a pre-training strategy, Conditional Geometric Reconstruction, where the model learns to reconstruct complete affordance region masks from known geometric observations, thereby building a robust geometric prior. Furthermore, to resolve semantic ambiguities, we design a feature injection mechanism that lifts rich semantic features from 2D Vision Foundation Models (VFM) and fuses them into the 3D decoder at multiple scales. Extensive experiments demonstrate that our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.

[98] Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions

Li Siyao,Yao Feng,Omid Tehari,Chen Change Loy,Michael J. Black

Main category: cs.CV

TL;DR: 该论文提出了将SMPL-X人体模型嵌入动态物理交互的’半物理’方法,解决了传统运动学模型无法与物体真实交互的问题,同时避免了穿透和不真实的物体动力学。

Details Motivation: 当前通用的3D人体模型(如SMPL-X)虽然在形状和姿态上表现高效,但缺乏物理交互能力,导致交互时出现穿透和不真实的动力学问题。

Contribution: 提出了一种’半物理’机制,将运动学运动转化为物理模拟,同时保留运动学姿态控制,实现物理合理的交互。

Method: 方法通过嵌入SMPL-X为可动态交互的实体,结合物理模拟,确保交互的物理合理性,且无需额外学习。

Result: 该方法实时运行,能适应任意身体形状和动作,并保留了原始运动学运动的保真度。

Insight: 提出了一种轻量级、无需训练的物理交互方法,为运动学模型和物理模拟的结合提供了新思路。

Abstract: While current general-purpose 3D human models (e.g., SMPL-X) efficiently represent accurate human shape and pose, they lacks the ability to physically interact with the environment due to the kinematic nature. As a result, kinematic-based interaction models often suffer from issues such as interpenetration and unrealistic object dynamics. To address this limitation, we introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings. Specifically, we propose a “half-physics” mechanism that transforms 3D kinematic motion into a physics simulation. Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects, effectively eliminating penetration and unrealistic object dynamics. Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion; meanwhile, it operates in real time. Moreover, it preserves the fidelity of the original kinematic motion while seamlessly integrating physical interactions

[99] Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Miaosen Zhang,Ziqiang Xu,Jialiang Zhu,Qi Dai,Kai Qiu,Yifan Yang,Chong Luo,Tianyi Chen,Justin Wagle,Tim Franklin,Baining Guo

Main category: cs.CV

TL;DR: 论文研究了GUI接地模型训练的实证,提出了Phi-Ground模型家族,在多个基准测试中取得了最优性能,并分享了训练中的细节和经验。

Details Motivation: 当前端到端接地模型在挑战性基准测试中准确率不足65%,远未达到实际部署要求,因此需要改进模型训练方法以提高性能。

Contribution: 提出了Phi-Ground模型家族,在5个接地基准测试中取得了SOTA性能,尤其是在端到端设置下表现突出。

Method: 通过从数据收集到模型训练的详细实证研究,优化了接地模型的训练过程,最终开发出Phi-Ground系列模型。

Result: Phi-Ground在ScreenSpot-pro和UI-Vision等基准测试中分别取得43.2和27.2的分数,表现最优。

Insight: 论文中的训练细节和失败经验不仅适用于GUI接地任务,也对其他感知任务具有参考价值。

Abstract: With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{“Iron Man”}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}

[100] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

Zihan Wang,Jeff Tan,Tarasha Khurana,Neehar Peri,Deva Ramanan

Main category: cs.CV

TL;DR: MonoFusion提出了一种稀疏视角下的动态场景重建方法,通过融合单目相机重建结果,解决了多视角密集相机系统的高成本和局限性问题。

Details Motivation: 多视角密集相机系统(如Panoptic Studio)成本高昂且无法适用于野外场景。作者希望通过稀疏视角相机(如四个静态相机)重建动态场景,如修理自行车或跳舞等行为。

Contribution: 提出了一种将独立单目重建结果对齐的方法,实现时间和视角一致的动态场景重建,显著提升了稀疏视角下的重建质量。

Method: 通过精心对齐各相机的单目重建结果,生成一致的多视角动态重建。实验表明在稀疏视角下,该方法优于现有技术。

Result: 在PanopticStudio和Ego-Exo4D数据集上的实验显示,MonoFusion在渲染新视角时重建质量更高,代码和数据已开源。

Insight: 稀疏视角的动态重建可以通过融合单目结果实现,为低成本野外观测提供了可行方案。

Abstract: We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/ImNotPrepared/MonoFusion.

[101] Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Bowen Zhang,Sicheng Xu,Chuxin Wang,Jiaolong Yang,Feng Zhao,Dong Chen,Baining Guo

Main category: cs.CV

TL;DR: 本文提出了一种新颖的视频到4D生成框架,通过直接编码高斯泼溅(GS)及其时变信息,并结合时间感知的扩散变换器,实现了高质量的动态3D内容生成。

Details Motivation: 现有方法在从单视频输入生成高质量动态3D内容时面临挑战,包括数据构建成本高和联合表示3D形状、外观及运动的高维性。

Contribution: 提出了Direct 4DMesh-to-GS Variation Field VAE,直接编码规范GS及其时变信息,并通过高斯变化场扩散模型实现了高效的4D生成。

Method: 采用VAE直接编码规范GS及其时变信息,结合时间感知的扩散变换器训练模型,以生成高质量的4D内容。

Result: 在Objaverse数据集上训练的模型表现出卓越的生成质量,且在未见过真实视频输入时表现出良好的泛化能力。

Insight: 通过压缩高维动画数据到紧凑潜在空间,并结合扩散模型,为高质量动态3D内容生成提供了新思路。

Abstract: In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.

cs.LG [Back]

[102] SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy

RJ Skerry-Ryan,Julian Salazar,Soroosh Mariooryad,David Kao,Daisy Stanton,Eric Battenberg,Matt Shannon,Ron J. Weiss,Robin Scheibler,Jonas Rothfuss,Tom Bagby

Main category: cs.LG

TL;DR: SequenceLayers是一个用于序列建模的神经网络层API和库,旨在简化序列模型的创建,支持逐层(如教师强制训练)和逐步(如自回归采样)执行。其通过显式状态表示和步进方法实现高效流式处理,减少常见错误,并提供兼容性强的实现。

Details Motivation: 传统序列模型在流式处理和并行处理中常出现状态管理复杂和错误频发的问题,SequenceLayers旨在通过统一的状态管理机制和API设计解决这些问题,简化模型开发和部署。

Contribution: 1. 提出了一个显式状态表示的神经网络层API,支持流式和并行序列处理。2. 通过步进方法和状态管理机制,确保逐层和逐步执行结果一致,减少错误。3. 提供可组合和声明式API,简化生产级模型的构建。

Method: 1. 定义层的显式状态表示(如Transformer的KV缓存、卷积缓冲区、RNN隐状态)。2. 提供step方法,支持逐步执行状态更新,并与逐层执行结果保持一致。3. 设计可组合的API和丰富的层与组合器,便于模型构建。

Result: SequenceLayers实现了高效的流式序列处理,解决了状态管理和执行一致性问题,已在JAX和TensorFlow 2中实现,并开源。

Insight: 显式状态管理和统一执行机制是流式序列处理的核心,通过高抽象层次的API设计,可以显著简化复杂模型的开发和维护。

Abstract: We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state), and a step method that evolves that state, tested to give identical results to a stateless layer-wise invocation. This and other aspects of the SequenceLayers contract enables complex models to be immediately streamable, mitigates a wide range of common bugs arising in both streaming and parallel sequence processing, and can be implemented in any deep learning library. A composable and declarative API, along with a comprehensive suite of layers and combinators, streamlines the construction of production-scale models from simple streamable components while preserving strong correctness guarantees. Our current implementations of SequenceLayers (JAX, TensorFlow 2) are available at https://github.com/google/sequence-layers.

[103] Planning for Cooler Cities: A Multimodal AI Framework for Predicting and Mitigating Urban Heat Stress through Urban Landscape Transformation

Shengao Yi,Xiaojiang Li,Wei Tu,Tianhong Zhao

Main category: cs.LG

TL;DR: GSM-UTCI是一种多模态深度学习框架,用于预测城市热应力,通过动态融合地表形态和气象数据,实现了接近物理模型的准确性和高效性,并为城市景观改造提供决策支持。

Details Motivation: 随着气候变化和城市化加剧,城市热应力问题日益严重,传统物理模型计算成本高,限制了其在大规模城市规划中的应用。

Contribution: 提出了GSM-UTCI框架,融合地表形态和气象数据,实现了高精度和高效的热应力预测,并为城市改造提供科学依据。

Method: 采用多模态深度学习(FiLM架构),融合nDSM、高分辨率土地覆盖数据和气象条件,训练基于SOLWEIG的UTCI预测模型。

Result: 模型R2达0.9151,MAE为0.41°C,推理时间从几小时缩短至5分钟;通过城市景观改造模拟,树冠替换不透水区域降温效果最显著。

Insight: GSM-UTCI为城市气候适应提供了可扩展的精细化决策工具,揭示了不同城市景观改造策略的降温潜力。

Abstract: As extreme heat events intensify due to climate change and urbanization, cities face increasing challenges in mitigating outdoor heat stress. While traditional physical models such as SOLWEIG and ENVI-met provide detailed assessments of human-perceived heat exposure, their computational demands limit scalability for city-wide planning. In this study, we propose GSM-UTCI, a multimodal deep learning framework designed to predict daytime average Universal Thermal Climate Index (UTCI) at 1-meter hyperlocal resolution. The model fuses surface morphology (nDSM), high-resolution land cover data, and hourly meteorological conditions using a feature-wise linear modulation (FiLM) architecture that dynamically conditions spatial features on atmospheric context. Trained on SOLWEIG-derived UTCI maps, GSM-UTCI achieves near-physical accuracy, with an R2 of 0.9151 and a mean absolute error (MAE) of 0.41{\deg}C, while reducing inference time from hours to under five minutes for an entire city. To demonstrate its planning relevance, we apply GSM-UTCI to simulate systematic landscape transformation scenarios in Philadelphia, replacing bare earth, grass, and impervious surfaces with tree canopy. Results show spatially heterogeneous but consistently strong cooling effects, with impervious-to-tree conversion producing the highest aggregated benefit (-4.18{\deg}C average change in UTCI across 270.7 km2). Tract-level bivariate analysis further reveals strong alignment between thermal reduction potential and land cover proportions. These findings underscore the utility of GSM-UTCI as a scalable, fine-grained decision support tool for urban climate adaptation, enabling scenario-based evaluation of greening strategies across diverse urban environments.

[104] Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods

Siwoo Park

Main category: cs.LG

TL;DR: 本文研究了多模态潜在空间的可逆性问题,发现基于优化的方法在反向映射时存在局限性,导致语义不连贯和感知质量低下。

Details Motivation: 多模态模型在前向任务(如文本到图像生成)上表现出色,但其反向映射能力尚未被充分探索。本文旨在验证这些潜在空间是否支持有意义且连贯的反向映射。

Contribution: 提出基于优化的反向映射框架,并验证多模态潜在空间在反向任务中的局限性,揭示了其缺乏语义解释性和感知一致性的问题。

Method: 使用优化框架对文本-图像和文本-音频模型(如BLIP、Chatterbox-TTS)进行双向反向映射实验,评估生成结果的语义和感知质量。

Result: 实验表明,优化虽能生成文本对齐的输出,但反向映射的感知质量混沌且语义不连贯,潜在空间嵌入缺乏可解释性。

Insight: 当前多模态潜在空间主要用于前向任务优化,缺乏支持稳健反向映射的结构,需进一步研究开发真正可逆的语义丰富潜在空间。

Abstract: This paper investigates the inverse capabilities and broader utility of multimodal latent spaces within task-specific AI (Artificial Intelligence) models. While these models excel at their designed forward tasks (e.g., text-to-image generation, audio-to-text transcription), their potential for inverse mappings remains largely unexplored. We propose an optimization-based framework to infer input characteristics from desired outputs, applying it bidirectionally across Text-Image (BLIP, Flux.1-dev) and Text-Audio (Whisper-Large-V3, Chatterbox-TTS) modalities. Our central hypothesis posits that while optimization can guide models towards inverse tasks, their multimodal latent spaces will not consistently support semantically meaningful and perceptually coherent inverse mappings. Experimental results consistently validate this hypothesis. We demonstrate that while optimization can force models to produce outputs that align textually with targets (e.g., a text-to-image model generating an image that an image captioning model describes correctly, or an ASR model transcribing optimized audio accurately), the perceptual quality of these inversions is chaotic and incoherent. Furthermore, when attempting to infer the original semantic input from generative models, the reconstructed latent space embeddings frequently lack semantic interpretability, aligning with nonsensical vocabulary tokens. These findings highlight a critical limitation. multimodal latent spaces, primarily optimized for specific forward tasks, do not inherently possess the structure required for robust and interpretable inverse mappings. Our work underscores the need for further research into developing truly semantically rich and invertible multimodal latent spaces.

[105] FuseTen: A Generative Model for Daily 10 m Land Surface Temperature Estimation from Spatio-Temporal Satellite Observations

Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai

Main category: cs.LG

TL;DR: FuseTen 是一个生成式模型,通过融合 Sentinel-2、Landsat 8 和 Terra MODIS 的卫星观测数据,生成空间分辨率为 10 米的每日地表温度(LST)估算。

Details Motivation: 在气候变化背景下,城市热浪、干旱和土地退化问题日益严重,需要高精度的地表温度时空数据进行研究。然而,现有卫星数据在时空分辨率上存在权衡,FuseTen 旨在填补这一技术空白。

Contribution: 提出了首个非线性方法,能够以 10 米的空间分辨率生成每日 LST 估算,并在定量和视觉指标上显著优于基线方法。

Method: 采用生成式架构,结合注意力与归一化模块,使用基于物理原则的平均监督策略和 PatchGAN 判别器以增强数据的真实性。

Result: 实验表明,FuseTen 在定量指标上平均提升 32.06%,视觉保真度提升 31.42%。

Insight: 通过生成式模型融合多源卫星数据,能够显著提升 LST 的空间分辨率,为气候变化研究提供更精细的数据支持。

Abstract: Urban heatwaves, droughts, and land degradation are pressing and growing challenges in the context of climate change. A valuable approach to studying them requires accurate spatio-temporal information on land surface conditions. One of the most important variables for assessing and understanding these phenomena is Land Surface Temperature (LST), which is derived from satellites and provides essential information about the thermal state of the Earth’s surface. However, satellite platforms inherently face a trade-off between spatial and temporal resolutions. To bridge this gap, we propose FuseTen, a novel generative framework that produces daily LST observations at a fine 10 m spatial resolution by fusing spatio-temporal observations derived from Sentinel-2, Landsat 8, and Terra MODIS. FuseTen employs a generative architecture trained using an averaging-based supervision strategy grounded in physical principles. It incorporates attention and normalization modules within the fusion process and uses a PatchGAN discriminator to enforce realism. Experiments across multiple dates show that FuseTen outperforms linear baselines, with an average 32.06% improvement in quantitative metrics and 31.42% in visual fidelity. To the best of our knowledge, this is the first non-linear method to generate daily LST estimates at such fine spatial resolution.

[106] DepMicroDiff: Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data

Rabeya Tus Sadia,Qiang Cheng

Main category: cs.LG

TL;DR: DepMicroDiff 是一种结合扩散模型和依赖感知Transformer的多模态微生物组数据插补框架,显著提升了插补性能。

Details Motivation: 微生物组数据的稀疏性和噪声问题严重影响了其分析和下游任务,现有方法难以捕捉复杂的微生物间依赖关系及上下文元数据。

Contribution: 1. 提出DepMicroDiff框架,结合扩散模型和依赖感知Transformer;2. 利用VAE预训练和多模态元数据(LLM编码)增强性能;3. 在TCGA数据集上表现优于现有方法。

Method: 1. 扩散生成模型用于数据插补;2. 依赖感知Transformer捕获微生物间的成对依赖和自回归关系;3. 结合VAE预训练和LLM编码元数据。

Result: 在多个癌症类型中,Pearson相关性和余弦相似度显著提升(最高0.712和0.812),RMSE和MAE降低。

Insight: 1. 依赖感知建模和多模态元数据结合对微生物组插补至关重要;2. 扩散模型适用于复杂生物数据建模。

Abstract: Microbiome data analysis is essential for understanding host health and disease, yet its inherent sparsity and noise pose major challenges for accurate imputation, hindering downstream tasks such as biomarker discovery. Existing imputation methods, including recent diffusion-based models, often fail to capture the complex interdependencies between microbial taxa and overlook contextual metadata that can inform imputation. We introduce DepMicroDiff, a novel framework that combines diffusion-based generative modeling with a Dependency-Aware Transformer (DAT) to explicitly capture both mutual pairwise dependencies and autoregressive relationships. DepMicroDiff is further enhanced by VAE-based pretraining across diverse cancer datasets and conditioning on patient metadata encoded via a large language model (LLM). Experiments on TCGA microbiome datasets show that DepMicroDiff substantially outperforms state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer types, demonstrating its robustness and generalizability for microbiome imputation.

[107] Consensus-Driven Active Model Selection

Justin Kay,Grant Van Horn,Subhransu Maji,Daniel Sheldon,Sara Beery

Main category: cs.LG

TL;DR: CODA是一种主动模型选择方法,通过利用候选模型的预测结果,优先标注能够高效区分最佳模型的数据点,显著减少了标注工作量。

Details Motivation: 传统模型选择需要大量标注验证数据,过程耗时且昂贵。CODA旨在通过主动选择关键数据点来优化这一过程。

Contribution: 提出了一种基于共识驱动的主动模型选择框架CODA,利用模型间的一致性和分歧指导标注过程,并通过贝叶斯推断更新模型性能。

Method: CODA通过概率模型建模分类器、类别和数据点之间的关系,利用模型间的共识与分歧优化标注优先级。

Result: 在26个基准任务上,CODA显著优于现有方法,将发现最佳模型所需的标注工作量减少了70%以上。

Insight: 模型间的共识与分歧信息可用于高效指导标注过程,极大提升模型选择的效率。

Abstract: The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset – a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Code and data are available at https://github.com/justinkay/coda.

eess.AS [Back]

[108] MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Yadong Niu,Tianzi Wang,Heinrich Dinkel,Xingwei Sun,Jiahao Zhou,Gang Li,Jizhong Liu,Xunying Liu,Junbo Zhang,Jian Luan

Main category: eess.AS

TL;DR: MECAT是首个多专家构建的细粒度音频理解基准,结合专家分析和Chain-of-Thought大模型推理生成多视角描述与问答对,并提出了创新性评估指标DATE。

Details Motivation: 现有音频理解模型与人类理解差距明显,主要因当前基准的数据标注和评估指标不足,无法区分泛泛输出和细节描述。

Contribution: 1) 提出MECAT基准,集成了专家模型与Chain-of-Thought大模型的细粒度标注;2) 提出DATE评估指标,结合语义相似性和跨样本判别力;3) 对当前音频模型能力与局限提供新见解。

Method: 通过专家模型分析和大语言模型推理的结合生成多视角标注(描述与问答对),并设计DATE指标评估模型输出。

Result: MECAT基准全面评估了SOTA音频模型,揭示了其在细粒度任务上的不足。

Insight: 当前音频模型在细节捕捉和区分能力上仍有欠缺,需更精细的标注与评估推动进步。

Abstract: While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat

cs.AI [Back]

[109] DSBC : Data Science task Benchmarking with Context engineering

Ram Mohan Rao Kadiyala,Siddhant Gupta,Jebish Purbey,Giulio Martini,Suman Debnath,Hamza Farooq

Main category: cs.AI

TL;DR: 论文提出了一个针对数据科学任务的基准测试DSBC,基于真实用户交互设计,评估了三种大语言模型在不同方法下的表现,强调实用部署中的关键因素。

Details Motivation: 当前数据科学代理的评测缺乏系统性,且实际应用效果尚不明确。

Contribution: 1)设计了反映真实用户交互的基准测试;2)评估了三种LLM在不同方法下的表现;3)分析了模型对提示问题和温度参数的敏感性。

Method: 通过零样本上下文工程、多步上下文工程和SmolAgent三种方法,对三种LLM在八类数据科学任务中的表现进行评测。

Result: 不同模型和方法之间存在显著性能差异,揭示了影响实际部署的关键因素。

Insight: 上下文工程和温度参数对模型性能有显著影响,未来研究需关注这些因素以提高数据科学代理的鲁棒性和有效性。

Abstract: Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.

[110] TextQuests: How Good are LLMs at Text-Based Video Games?

Long Phan,Mantas Mazeika,Andy Zou,Dan Hendrycks

Main category: cs.AI

TL;DR: 论文提出了TextQuests基准,基于Infocom的互动小说游戏,用于评估AI代理在自主探索环境中的长上下文推理能力。

Details Motivation: 现有的AI代理基准未能充分评估代理在需要长时自主推理的探索性环境中的能力,因此需要新的评估工具。

Contribution: 提出了TextQuests基准,专注于评估LLM代理在无需外部工具支持的情况下,进行自主问题解决和长上下文推理的能力。

Method: 基于Infocom的互动小说游戏设计基准,要求代理在长时间内持续解决问题,且禁止使用外部工具。

Result: TextQuests提供了评估代理在复杂探索性环境中表现的新方法,促进了更强大推理能力代理的开发。

Insight: 互动小说游戏是评估AI代理长时自主推理能力的有效工具,强调了长上下文推理的重要性。

Abstract: Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent’s ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent’s capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.

[111] Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Luoxin Chen,Jinming Gu,Liankai Huang,Wenhao Huang,Zhicheng Jiang,Allan Jie,Xiaoran Jin,Xing Jin,Chenggang Li,Kaijing Ma,Cheng Ren,Jiawei Shen,Wenlei Shi,Tong Sun,He Sun,Jiahui Wang,Siran Wang,Zhihong Wang,Chenrui Wei,Shufa Wei,Yonghui Wu,Yuchen Wu,Yihang Xia,Huajian Xin,Fan Yang,Huaiyuan Ying,Hongyi Yuan,Zheng Yuan,Tianyang Zhan,Chi Zhang,Yue Zhang,Ge Zhang,Tianyun Zhao,Jianqiu Zhao,Yichi Zhou,Thomas Hanwen Zhu

Main category: cs.AI

TL;DR: Seed-Prover是一种基于Lean形式验证的自动定理证明模型,通过迭代优化证明并引入几何引擎Seed-Geometry,在IMO竞赛中表现优异,显著提升了自动数学推理的能力。

Details Motivation: 现有大型语言模型(LLMs)在数学推理中表现优异,但在定理证明中缺乏清晰的监督信号,导致效果不佳。Seed-Prover旨在通过形式验证和长链推理解决这一问题。

Contribution: 1. 提出Seed-Prover,一种基于Lean形式验证的迭代优化证明模型。2. 设计三种推理策略,实现深度和广度推理。3. 引入Seed-Geometry引擎,解决Lean中的几何推理不足问题。

Method: Seed-Prover通过Lean的反馈、已证明的引理和自我总结迭代优化证明。设计了三种推理策略以解决IMO级问题。Seed-Geometry作为补充引擎用于几何推理。

Result: Seed-Prover在形式化IMO问题中达到78.1%的证明率,远超之前最优方法。在IMO 2025中完全解决了5/6问题。

Insight: 形式验证与长链推理的结合能显著提升自动定理证明的效果,几何引擎的引入扩展了系统的能力边界。

Abstract: LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.

[112] CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Ping Yu,Jack Lanchantin,Tianlu Wang,Weizhe Yuan,Olga Golovneva,Ilia Kulikov,Sainbayar Sukhbaatar,Jason Weston,Jing Xu

Main category: cs.AI

TL;DR: CoT-Self-Instruct提出了一种合成数据生成方法,利用Chain-of-Thought(CoT)引导LLMs生成高质量且复杂的提示,显著提升了推理和非推理任务的表现。

Details Motivation: 现有合成数据生成方法在推理和非推理任务中的质量不足,需要一种能够自动生成高质量提示的新方法。

Contribution: 提出CoT-Self-Instruct方法,结合CoT和自动过滤,显著提升了合成数据在多种任务中的表现。

Method: 通过让LLMs基于种子任务进行推理和规划,生成新的提示,并利用自动指标过滤高质量数据。

Result: 在MATH500等推理任务和AlpacaEval 2.0等非推理任务中,性能显著优于现有方法。

Insight: 结合CoT的自动提示生成和过滤是提升LLM训练数据质量的有效途径。

Abstract: We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, across MATH500, AMC23, AIME24 and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of human or standard self-instruct prompts on both AlpacaEval 2.0 and Arena-Hard.

[113] SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model

Mingkai Deng,Jinyu Hou,Yilin Shen,Hongxia Jin,Graham Neubig,Zhiting Hu,Eric Xing

Main category: cs.AI

TL;DR: SimuRA提出了一种基于LLM的世界模型的通用目标导向智能体架构,通过模拟推理克服自回归模型的局限性,实验表明在复杂任务中性能显著提升。

Details Motivation: 现有基于LLM的智能体通常针对单一任务设计,缺乏通用性和扩展性,而人类通过模拟推理实现通用目标。SimuRA的提出旨在解决这一问题。

Contribution: 1. 提出了SimuRA架构,结合LLM实现通用世界模型;2. 展示了在复杂任务(如网页浏览)中,模拟推理优于自回归推理的显著优势。

Method: SimuRA利用LLM构建世界模型,支持灵活的任务规划。通过模拟行动和计划的结果,实现目标导向的推理。

Result: 在航班搜索任务中,成功率从0%提升至32.2%;基于世界模型的规划比自回归规划性能提升最高达124%。

Insight: 模拟推理范式(世界模型)为通用智能体提供了一个有前景的方向,可能推动单一通用模型的训练。

Abstract: AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs. On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of optimal agent in any environment, \modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation. The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language. Experiments on difficult web browsing tasks show that \modelname improves the success of flight search from 0% to 32.2%. World-model-based planning, in particular, shows consistent advantage of up to 124% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm. We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments. To start, we make SimuRA, a web-browsing agent built on \modelname with pretrained LLMs, available as a research demo for public testing.

cs.IR [Back]

[114] Holistic Evaluations of Topic Models

Thomas Compton

Main category: cs.IR

TL;DR: 本文从数据库视角评估主题模型,分析1140个BERTopic模型的运行结果,探讨参数优化的权衡及其对主题模型解释和负责任使用的影响。

Details Motivation: 主题模型因其能总结大量非结构化文本而在商业和学术领域受到关注,但其可能成为‘黑箱’,用户缺乏对其输出的验证。本文旨在揭示参数优化的权衡,帮助用户更负责任地使用主题模型。

Contribution: 通过大规模实验(1140次BERTopic模型运行),揭示了参数优化的关键权衡点,并提供了对主题模型解释和实用性更深刻的理解。

Method: 数据库视角的评估方法,通过多次运行BERTopic模型分析参数调整对结果的影响,采用统计分析总结规律。

Result: 实验结果表明,参数设置对主题模型输出有显著影响,用户需在模型性能与解释性之间权衡。

Insight: 主题模型的使用需结合领域知识和用户需求,避免盲目依赖算法输出;参数优化不仅是技术问题,也涉及模型的可解释性和实用性。

Abstract: Topic models are gaining increasing commercial and academic interest for their ability to summarize large volumes of unstructured text. As unsupervised machine learning methods, they enable researchers to explore data and help general users understand key themes in large text collections. However, they risk becoming a ‘black box’, where users input data and accept the output as an accurate summary without scrutiny. This article evaluates topic models from a database perspective, drawing insights from 1140 BERTopic model runs. The goal is to identify trade-offs in optimizing model parameters and to reflect on what these findings mean for the interpretation and responsible use of topic models

cs.CR [Back]

[115] Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Lijia Liu,Takumi Kondo,Kyohei Atarashi,Koh Takeuchi,Jiyi Li,Shigeru Saito,Hisashi Kashima

Main category: cs.CR

TL;DR: 该论文提出了一种结合标准评估(SE)和反事实评估(CFE)的框架,用于检测LLM评估系统中针对提示注入的盲攻击。实验表明,该方法显著提高了安全性,且性能损失极小。

Details Motivation: LLM(大语言模型)评估系统容易受到提示注入攻击的威胁,尤其是所谓的盲攻击(即候选答案独立于真实答案设计以欺骗评估者)。现有方法难以检测此类攻击,亟需更有效的防御机制。

Contribution: 1. 形式化了LLM评估系统面临的盲攻击威胁;2. 提出了结合SE和CFE的框架,通过反事实评估增强检测能力;3. 通过实验验证了该方法在攻击检测和性能上的优势。

Method: 提出将标准评估(SE)与反事实评估(CFE)结合,其中CFE通过使用错误的标准答案重新评估来检测攻击。如果答案在SE和CFE下均被接受,则触发攻击警报。

Result: 实验结果显示,标准评估对盲攻击高度脆弱,而SE+CFE框架显著提高了攻击检测率,且对正常评估任务的性能影响极小。

Insight: 反事实评估为检测LLM评估系统中的欺骗行为提供了新思路,未来可在其他安全场景中扩展应用。

Abstract: This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.

cs.SE [Back]

[116] SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

Han Li,Yuling Shi,Shaoxin Lin,Xiaodong Gu,Heng Lian,Xin Wang,Yantao Jia,Tao Huang,Qianxiang Wang

Main category: cs.SE

TL;DR: SWE-Debate提出了一种基于竞争性多智能体辩论的框架,用于解决软件问题,通过多样化的推理路径和协作式收敛,显著提升问题定位和修复效果。

Details Motivation: 现有基于智能体的问题解决方法通常依赖独立探索,容易陷入局部最优解,而无法发现跨代码库的问题模式。因此,作者提出通过多智能体辩论来激发多样化的推理路径。

Contribution: 1. 提出SWE-Debate框架,通过竞争性多智能体辩论促进问题定位;2. 设计三阶段辩论流程,从不同视角协作收敛修复方案;3. 实验证明其在SWE-bench上达到新SOTA。

Method: 1. 通过代码依赖图生成故障传播路径作为定位提案;2. 组织三阶段辩论,各智能体基于不同视角进行推理;3. 结合MCTS的代码修改智能体生成补丁。

Result: 在SWE-bench基准测试中,SWE-Debate显著优于基线方法,并达到开源智能体框架的最高水平。

Insight: 通过竞争性多智能体辩论,能够打破局部最优,有效整合跨代码库的问题模式,提升软件问题解决能力。

Abstract: Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents’ independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.

[117] SWE-Exp: Experience-Driven Software Issue Resolution

Silin Chen,Shaoxin Lin,Xiaodong Gu,Yuling Shi,Heng Lian,Longfei Yun,Dong Chen,Weiguo Sun,Lin Cao,Qianxiang Wang

Main category: cs.SE

TL;DR: SWE-Exp提出了一种基于经验的软件问题解决方法,通过记录和重用先前的修复经验,避免冗余探索,实现了持续学习。

Details Motivation: 当前LLM代理在软件问题解决中缺乏记忆性,无法重用先前的修复经验,导致冗余探索和效率低下。

Contribution: 提出了SWE-Exp方法,通过多层面经验库(包括成功和失败的修复尝试)实现连续学习和知识迁移。

Method: 引入多层面经验库,从高层次问题理解到具体代码修改提取可重用知识,并结合MCTS优化学习。

Result: 在SWE-bench-Verified数据集上达到41.6% Pass@1的解决率,性能表现领先。

Insight: 表明通过系统积累和利用修复经验,软件工程代理可以从试错探索转向战略性的问题解决。

Abstract: Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.

eess.IV [Back]

[118] Rethink Domain Generalization in Heterogeneous Sequence MRI Segmentation

Zheyuan Zhang,Linkai Peng,Wanying Dou,Cuiling Sun,Halil Ertugrul Aktas,Andrea M. Bejar,Elif Keles,Gorkem Durak,Ulas Bagci

Main category: eess.IV

TL;DR: 这篇论文提出了一个名为PancreasDG的大规模多中心3D MRI胰腺分割数据集,专注于研究医学影像中的域泛化问题,解决了现有基准测试忽视的跨序列变异性问题,并提出了一种半监督方法,显著提升了性能。

Details Motivation: 现有的域泛化基准测试主要关注跨中心的变化,而忽视了MRI中T1和T2序列间的显著差异。胰腺分割在腹部成像中具有挑战性且临床重要性高,但现有方法对其分割效果不佳。

Contribution: 1. 提出了PancreasDG数据集,涵盖6个机构的563个MRI扫描,支持跨中心和跨序列变化的研究。2. 揭示了采样不足可能被误认为分布偏移,并且跨序列变化需要专门解决方案。3. 提出了一种半监督方法,显著优于现有域泛化技术。

Method: 采用了一种半监督学习方法,利用解剖学不变性特征。通过双盲双审协议生成高质量的胰腺掩码标签,并在跨序列和跨中心分割任务中验证方法。

Result: 所提方法在跨序列分割任务中显著优于现有技术,Dice分数提升了61.63%,在两个测试中心的跨序列分割中达到87.00%的Dice分数。

Insight: 1. 采样不足会引入显著方差。2. 跨序列变化比跨中心变化更具挑战性,需专门解决方案。3. 解剖学不变性特征是解决域泛化的有效途径。

Abstract: Clinical magnetic-resonance (MR) protocols generate many T1 and T2 sequences whose appearance differs more than the acquisition sites that produce them. Existing domain-generalization benchmarks focus almost on cross-center shifts and overlook this dominant source of variability. Pancreas segmentation remains a major challenge in abdominal imaging: the gland is small, irregularly, surrounded by organs and fat, and often suffers from low T1 contrast. State-of-the-art deep networks that already achieve >90% Dice on the liver or kidneys still miss 20-30% of the pancreas. The organ is also systematically under-represented in public cross-domain benchmarks, despite its clinical importance in early cancer detection, surgery, and diabetes research. To close this gap, we present PancreasDG, a large-scale multi-center 3D MRI pancreas segmentation dataset for investigating domain generalization in medical imaging. The dataset comprises 563 MRI scans from six institutions, spanning both venous phase and out-of-phase sequences, enabling study of both cross-center and cross-sequence variations with pixel-accurate pancreas masks created by a double-blind, two-pass protocol. Through comprehensive analysis, we reveal three insights: (i) limited sampling introduces significant variance that may be mistaken for distribution shifts, (ii) cross-center performance correlates with source domain performance for identical sequences, and (iii) cross-sequence shifts require specialized solutions. We also propose a semi-supervised approach that leverages anatomical invariances, significantly outperforming state-of-the-art domain generalization techniques with 61.63% Dice score improvements and 87.00% on two test centers for cross-sequence segmentation. PancreasDG sets a new benchmark for domain generalization in medical imaging. Dataset, code, and models will be available at https://pancreasdg.netlify.app.

[119] Learning Arbitrary-Scale RAW Image Downscaling with Wavelet-based Recurrent Reconstruction

Yang Ren,Hai Jiang,Wei Li,Menglong Yang,Heng Zhang,Zehua Sheng,Qingsheng Ye,Shuaicheng Liu

Main category: eess.IV

TL;DR: 这篇论文提出了一种基于小波的循环重建框架,用于任意尺度的RAW图像下采样。通过低频和高频模块保留结构和纹理完整性,并引入新的数据集和损失函数,显著优于现有方法。

Details Motivation: 现有的图像下采样方法主要针对sRGB域,而RAW图像因其未处理的原始信息更具灵活性,但缺乏专门的框架。研究旨在解决这一空白。

Contribution: 1. 提出了一种任意尺度的RAW图像下采样框架;2. 设计了低频和高频模块以保持图像质量;3. 引入了新数据集Real-NIRD。

Method: 基于小波变换的循环重建框架,结合低频任意尺度下采样模块(LASDM)和高频预测模块(HFPM),并通过能量最大化损失对齐高频能量。

Result: 实验表明,该方法在定量和视觉指标上均优于现有技术。

Insight: 利用小波变换的无损信息属性,可以更灵活地实现高质量的下采样,尤其在RAW图像处理中具有广泛应用潜力。

Abstract: Image downscaling is critical for efficient storage and transmission of high-resolution (HR) images. Existing learning-based methods focus on performing downscaling within the sRGB domain, which typically suffers from blurred details and unexpected artifacts. RAW images, with their unprocessed photonic information, offer greater flexibility but lack specialized downscaling frameworks. In this paper, we propose a wavelet-based recurrent reconstruction framework that leverages the information lossless attribute of wavelet transformation to fulfill the arbitrary-scale RAW image downscaling in a coarse-to-fine manner, in which the Low-Frequency Arbitrary-Scale Downscaling Module (LASDM) and the High-Frequency Prediction Module (HFPM) are proposed to preserve structural and textural integrity of the reconstructed low-resolution (LR) RAW images, alongside an energy-maximization loss to align high-frequency energy between HR and LR domain. Furthermore, we introduce the Realistic Non-Integer RAW Downscaling (Real-NIRD) dataset, featuring a non-integer downscaling factor of 1.3$\times$, and incorporate it with publicly available datasets with integer factors (2$\times$, 3$\times$, 4$\times$) for comprehensive benchmarking arbitrary-scale image downscaling purposes. Extensive experiments demonstrate that our method outperforms existing state-of-the-art competitors both quantitatively and visually. The code and dataset will be released at https://github.com/RenYangSCU/ASRD.

[120] EMedNeXt: An Enhanced Brain Tumor Segmentation Framework for Sub-Saharan Africa using MedNeXt V2 with Deep Supervision

Ahmed Jaheen,Abdelrahman Elsayed,Damir Kim,Daniil Tikhonov,Matheus Scatolin,Mohor Banerjee,Qiankun Ji,Mostafa Salem,Hu Wang,Sarim Hashmi,Mohammad Yaqub

Main category: eess.IV

TL;DR: EMedNeXt是一个改进的脑肿瘤分割框架,针对撒哈拉以南非洲地区的低资源环境优化,通过扩大感兴趣区域、改进的nnU-Net V2骨架和模型集成系统,在隐藏验证集上表现优异。

Details Motivation: 撒哈拉以南非洲地区的MRI设备质量低、放射学专家稀缺,导致脑肿瘤分割和量化困难。EMedNeXt旨在解决这些问题,优化分割性能。

Contribution: 1. 扩大感兴趣区域;2. 改进的nnU-Net V2骨架;3. 鲁棒的模型集成系统。

Method: 基于MedNeXt V2框架,引入深度监督和优化的后处理流程,结合nnU-Net v2架构,适用于低资源环境。

Result: 在隐藏验证集上,平均LesionWise DSC为0.897,NSD在0.5 mm和1.0 mm容忍度下分别为0.541和0.84。

Insight: 在低资源地区,通过改进网络架构和模型集成,可以显著提升脑肿瘤分割的准确性和鲁棒性。

Abstract: Brain cancer affects millions worldwide, and in nearly every clinical setting, doctors rely on magnetic resonance imaging (MRI) to diagnose and monitor gliomas. However, the current standard for tumor quantification through manual segmentation of multi-parametric MRI is time-consuming, requires expert radiologists, and is often infeasible in under-resourced healthcare systems. This problem is especially pronounced in low-income regions, where MRI scanners are of lower quality and radiology expertise is scarce, leading to incorrect segmentation and quantification. In addition, the number of acquired MRI scans in Africa is typically small. To address these challenges, the BraTS-Lighthouse 2025 Challenge focuses on robust tumor segmentation in sub-Saharan Africa (SSA), where resource constraints and image quality degradation introduce significant shifts. In this study, we present EMedNeXt – an enhanced brain tumor segmentation framework based on MedNeXt V2 with deep supervision and optimized post-processing pipelines tailored for SSA. EMedNeXt introduces three key contributions: a larger region of interest, an improved nnU-Net v2-based architectural skeleton, and a robust model ensembling system. Evaluated on the hidden validation set, our solution achieved an average LesionWise DSC of 0.897 with an average LesionWise NSD of 0.541 and 0.84 at a tolerance of 0.5 mm and 1.0 mm, respectively.

[121] Pixel Embedding Method for Tubular Neurite Segmentation

Huayu Fu,Jiamin Li,Haozhi Qu,Xiaolin Hu,Zengcai Guo

Main category: eess.IV

TL;DR: 提出了一种基于像素嵌入的神经管分割方法,结合深度学习网络和端到端流程,显著降低了神经拓扑重建的错误率,并提出了新的拓扑评估指标。

Details Motivation: 神经元分支的复杂形态和纤维之间的遮挡为基于深度学习的分割带来了挑战,为解决这些问题,需要更有效的方法来提高分割精度和重建质量。

Contribution: 1) 提出像素级嵌入向量和相应的损失函数;2) 开发端到端流程,直接从图像生成SWC格式的神经元结构树;3) 提出新的拓扑评估指标。

Method: 结合深度学习网络输出像素嵌入向量,设计损失函数以区分遮挡区域的神经元连接,并通过端到端流程生成神经元结构树。

Result: 在fMOST成像数据集上,显著降低了神经拓扑重建的错误率。

Insight: 像素嵌入方法和拓扑评估指标的引入,为复杂神经结构的分割提供了更精准的工具。

Abstract: Automatic segmentation of neuronal topology is critical for handling large scale neuroimaging data, as it can greatly accelerate neuron annotation and analysis. However, the intricate morphology of neuronal branches and the occlusions among fibers pose significant challenges for deep learning based segmentation. To address these issues, we propose an improved framework: First, we introduce a deep network that outputs pixel level embedding vectors and design a corresponding loss function, enabling the learned features to effectively distinguish different neuronal connections within occluded regions. Second, building on this model, we develop an end to end pipeline that directly maps raw neuronal images to SWC formatted neuron structure trees. Finally, recognizing that existing evaluation metrics fail to fully capture segmentation accuracy, we propose a novel topological assessment metric to more appropriately quantify the quality of neuron segmentation and reconstruction. Experiments on our fMOST imaging dataset demonstrate that, compared to several classical methods, our approach significantly reduces the error rate in neuronal topology reconstruction.

[122] Smart Video Capsule Endoscopy: Raw Image-Based Localization for Enhanced GI Tract Investigation

Oliver Bause,Julia Werner,Paul Palomero Bernardo,Oliver Bringmann

Main category: eess.IV

TL;DR: 论文提出了一种基于原始Bayer图像的智能视频胶囊内窥镜系统,通过轻量化CNN和Viterbi解码实现高效分类,显著降低了能耗。

Details Motivation: 针对资源受限的边缘设备(如视频胶囊内窥镜),传统深度神经网络因模型过大和RGB转换能耗高而不适用,需提出更高效的解决方案。

Contribution: 1. 直接在Bayer图像上实现93.06%的分类准确率;2. 提出仅含63,000参数的轻量化CNN;3. 结合Viterbi解码的时间序列分析;4. 定制化SoC实现5.31μJ/图的低能耗分类。

Method: 1. 使用轻量化CNN直接在Bayer图像上进行分类;2. 引入Viterbi解码优化时间序列分析;3. 在PULPissimo SoC上集成硬件加速器。

Result: 系统平均节省89.9%的能耗(相比传统视频胶囊),每图分类仅需5.31μJ。

Insight: 通过跳过RGB转换和模型轻量化,可在边缘设备上实现高效AI应用,特别适用于医疗等低功耗场景。

Abstract: For many real-world applications involving low-power sensor edge devices deep neural networks used for image classification might not be suitable. This is due to their typically large model size and require- ment of operations often exceeding the capabilities of such resource lim- ited devices. Furthermore, camera sensors usually capture images with a Bayer color filter applied, which are subsequently converted to RGB images that are commonly used for neural network training. However, on resource-constrained devices, such conversions demands their share of energy and optimally should be skipped if possible. This work ad- dresses the need for hardware-suitable AI targeting sensor edge devices by means of the Video Capsule Endoscopy, an important medical proce- dure for the investigation of the small intestine, which is strongly limited by its battery lifetime. Accurate organ classification is performed with a final accuracy of 93.06% evaluated directly on Bayer images involv- ing a CNN with only 63,000 parameters and time-series analysis in the form of Viterbi decoding. Finally, the process of capturing images with a camera and raw image processing is demonstrated with a customized PULPissimo System-on-Chip with a RISC-V core and an ultra-low power hardware accelerator providing an energy-efficient AI-based image clas- sification approach requiring just 5.31 {\mu}J per image. As a result, it is possible to save an average of 89.9% of energy before entering the small intestine compared to classic video capsules.

[123] JPEG Processing Neural Operator for Backward-Compatible Coding

Woo Kyoung Han,Yongjun Lee,Byeonghun Lee,Sang Hyun Park,Sunghoon Im,Kyong Hwan Jin

Main category: eess.IV

TL;DR: JPNeO是一种兼容当前JPEG标准的下一代算法,通过神经网络操作改善色彩分量的保存和重建质量,同时减少内存和参数需求。

Details Motivation: 传统学习型压缩算法难以标准化,且缺乏向后兼容性,JPNeO旨在解决这些问题。

Contribution: 提出JPNeO,兼具JPEG兼容性和神经网络的高效性能,验证了高互信息空间的存在。

Method: 在编码和解码阶段集成神经网络操作,优化色彩分量和重建质量。

Result: JPNeO在保留兼容性的同时,提高了压缩效率和重建质量。

Insight: 神经网络操作可无缝嵌入传统编码协议,实现性能提升。

Abstract: Despite significant advances in learning-based lossy compression algorithms, standardizing codecs remains a critical challenge. In this paper, we present the JPEG Processing Neural Operator (JPNeO), a next-generation JPEG algorithm that maintains full backward compatibility with the current JPEG format. Our JPNeO improves chroma component preservation and enhances reconstruction fidelity compared to existing artifact removal methods by incorporating neural operators in both the encoding and decoding stages. JPNeO achieves practical benefits in terms of reduced memory usage and parameter count. We further validate our hypothesis about the existence of a space with high mutual information through empirical evidence. In summary, the JPNeO functions as a high-performance out-of-the-box image compression pipeline without changing source coding’s protocol. Our source code is available at https://github.com/WooKyoungHan/JPNeO.

[124] Towards Field-Ready AI-based Malaria Diagnosis: A Continual Learning Approach

Louise Guillon,Soheib Biga,Yendoube E. Kantchire,Mouhamadou Lamine Sane,Grégoire Pasquier,Kossi Yakpa,Stéphane E. Sossou,Marc Thellier,Laurent Bonnardot,Laurence Lachaud,Renaud Piarroux,Ameyo M. Dorkenoo

Main category: eess.IV

TL;DR: 该论文探讨了持续性学习(CL)在提高基于深度学习的疟疾计算机辅助诊断(CAD)系统跨域泛化能力中的作用。

Details Motivation: 疟疾是全球健康的重要挑战,特别是在资源匮乏地区,专家显微镜诊断难以普及。现有的深度学习CAD系统在域适应性上有局限,限制了临床部署。

Contribution: 论文的主要贡献是通过持续性学习方法增强了YOLO目标检测器在不同采集站点的适应性,同时保持对已有域的性能。

Method: 研究了四种CL策略(两种基于复习和两种基于正则化的方法),并在多站点临床数据集上进行评估。

Result: 结果表明,持续性学习(特别是基于复习的方法)显著提高了性能。

Insight: 持续性学习有望推动可部署的疟疾CAD工具的开发。

Abstract: Malaria remains a major global health challenge, particularly in low-resource settings where access to expert microscopy may be limited. Deep learning-based computer-aided diagnosis (CAD) systems have been developed and demonstrate promising performance on thin blood smear images. However, their clinical deployment may be hindered by limited generalization across sites with varying conditions. Yet very few practical solutions have been proposed. In this work, we investigate continual learning (CL) as a strategy to enhance the robustness of malaria CAD models to domain shifts. We frame the problem as a domain-incremental learning scenario, where a YOLO-based object detector must adapt to new acquisition sites while retaining performance on previously seen domains. We evaluate four CL strategies, two rehearsal-based and two regularization-based methods, on real-life conditions thanks to a multi-site clinical dataset of thin blood smear images. Our results suggest that CL, and rehearsal-based methods in particular, can significantly improve performance. These findings highlight the potential of continual learning to support the development of deployable, field-ready CAD tools for malaria.

cs.RO [Back]

[125] H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

Hongzhe Bi,Lingxuan Wu,Tianwei Lin,Hengkai Tan,Zhizhong Su,Hang Su,Jun Zhu

Main category: cs.RO

TL;DR: H-RDT是一种通过利用人类操作数据增强机器人操纵能力的新方法,采用扩散变换器架构和两阶段训练范式,在仿真和真实环境中显著优于现有方法。

Details Motivation: 模仿学习面临大规模高质量机器人演示数据稀缺的问题,而跨具身机器人数据集的多样性又增加了统一训练的难度。

Contribution: 提出H-RDT,利用人类操作视频中的行为先验增强机器人策略学习,并通过模块化动作编解码器实现跨具身微调。

Method: 两阶段训练:1) 基于人类操作数据预训练,2) 利用模块化动作编解码器进行机器人数据微调;采用扩散变换器架构和流匹配建模复杂动作分布。

Result: 在仿真和真实实验中分别提升13.9%和40.5%,显著优于从头训练和现有方法(如Pi0和RDT)。

Insight: 人类操作数据可作为机器人双手机器人操作策略学习的强大基础。

Abstract: Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including Pi0 and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.

[126] A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving

Yi Zhang,Erik Leo Haß,Kuo-Yi Chao,Nenad Petrovic,Yinglei Song,Chengdong Wu,Alois Knoll

Main category: cs.RO

TL;DR: 论文提出了一种统一的感知-语言-动作(PLA)框架,通过将多传感器融合与大型语言模型(如GPT-4.1)结合,实现自动驾驶系统的适应性、鲁棒性和可解释性。

Details Motivation: 当前自动驾驶系统在复杂开放环境中的适应性、鲁棒性和可解释性不足,且架构分散,难以应对新场景。

Contribution: 提出了PLA框架,结合多传感器和语言模型,实现了感知与语义理解的紧密耦合,提升了系统的上下文感知能力。

Method: 采用多传感器融合(摄像头、LiDAR、雷达)与GPT-4.1驱动的VLA架构,将低层感知与高层语义推理统一。

Result: 在城市交叉路口场景中,该框架在轨迹跟踪、速度预测和自适应规划方面表现优异。

Insight: 语言增强的认知框架有望推动自动驾驶系统在安全性、可解释性和可扩展性方面的进步。

Abstract: Autonomous driving systems face significant challenges in achieving human-like adaptability, robustness, and interpretability in complex, open-world environments. These challenges stem from fragmented architectures, limited generalization to novel scenarios, and insufficient semantic extraction from perception. To address these limitations, we propose a unified Perception-Language-Action (PLA) framework that integrates multi-sensor fusion (cameras, LiDAR, radar) with a large language model (LLM)-augmented Vision-Language-Action (VLA) architecture, specifically a GPT-4.1-powered reasoning core. This framework unifies low-level sensory processing with high-level contextual reasoning, tightly coupling perception with natural language-based semantic understanding and decision-making to enable context-aware, explainable, and safety-bounded autonomous driving. Evaluations on an urban intersection scenario with a construction zone demonstrate superior performance in trajectory tracking, speed prediction, and adaptive planning. The results highlight the potential of language-augmented cognitive frameworks for advancing the safety, interpretability, and scalability of autonomous driving systems.

[127] User Experience Estimation in Human-Robot Interaction Via Multi-Instance Learning of Multimodal Social Signals

Ryo Miyoshi,Yuki Okafuji,Takuya Iwamoto,Junya Nakanishi,Jun Baba

Main category: cs.RO

TL;DR: 该论文提出了一种基于多模态社交信号的用户体验(UX)估计方法,通过Transformer模型和多实例学习框架,结合面部表情和声音数据,捕捉短长期交互模式,优于人类评估者的表现。

Details Motivation: 随着社交机器人需求的增长,需要根据用户状态调整行为。现有的UX评估方法通常单一聚焦情感或参与度,缺乏多方面的综合评估。

Contribution: 1. 构建了UX数据集;2. 提出了一种基于Transformer和多实例学习的方法,结合多模态信号(面部+声音);3. 实现了对短长期交互模式的动态捕捉。

Method: 1. 使用Transformer模型处理面部表情和声音;2. 通过多实例学习框架整合短长期交互数据;3. 动态估计UX。

Result: 实验表明,该方法在UX估计上优于第三方人类评估者。

Insight: 多模态信号和多实例学习框架能更全面地捕捉用户体验的动态特性,为HRI行为调整提供了更精准的依据。

Abstract: In recent years, the demand for social robots has grown, requiring them to adapt their behaviors based on users’ states. Accurately assessing user experience (UX) in human-robot interaction (HRI) is crucial for achieving this adaptability. UX is a multi-faceted measure encompassing aspects such as sentiment and engagement, yet existing methods often focus on these individually. This study proposes a UX estimation method for HRI by leveraging multimodal social signals. We construct a UX dataset and develop a Transformer-based model that utilizes facial expressions and voice for estimation. Unlike conventional models that rely on momentary observations, our approach captures both short- and long-term interaction patterns using a multi-instance learning framework. This enables the model to capture temporal dynamics in UX, providing a more holistic representation. Experimental results demonstrate that our method outperforms third-party human evaluators in UX estimation.

cs.HC [Back]

[128] Hybrid EEG–Driven Brain–Computer Interface: A Large Language Model Framework for Personalized Language Rehabilitation

Ismail Hossain,Mridul Banik

Main category: cs.HC

TL;DR: 该论文提出了一种基于混合脑电图(EEG)和大型语言模型(LLM)的个性化语言康复框架,结合了BCI的低疲劳性和LLM的上下文生成能力,用于帮助有严重言语或运动障碍的患者进行语言康复。

Details Motivation: 传统的增强和替代沟通(AAC)系统与语言学习平台难以实时适应用户的认知和语言需求,尤其是在中风后失语症或肌萎缩侧索硬化症等神经疾病中。

Contribution: 提出了一种新颖的混合框架,结合EEG驱动的BCI和LLM,实现个性化的语言康复助手。

Method: 利用实时EEG信号驱动LLM,实现导航语言学习模块、动态个性化语言内容生成和任务难度调整。

Result: 系统能够帮助用户通过脑命令进行语言学习,并根据神经认知标记动态调整难度。

Insight: EEG与LLM的结合为语言康复提供了一种新的个性化方法,尤其是在严重运动或言语障碍的康复中。

Abstract: Conventional augmentative and alternative communication (AAC) systems and language-learning platforms often fail to adapt in real time to the user’s cognitive and linguistic needs, especially in neurological conditions such as post-stroke aphasia or amyotrophic lateral sclerosis. Recent advances in noninvasive electroencephalography (EEG)–based brain-computer interfaces (BCIs) and transformer–based large language models (LLMs) offer complementary strengths: BCIs capture users’ neural intent with low fatigue, while LLMs generate contextually tailored language content. We propose and evaluate a novel hybrid framework that leverages real-time EEG signals to drive an LLM-powered language rehabilitation assistant. This system aims to: (1) enable users with severe speech or motor impairments to navigate language-learning modules via mental commands; (2) dynamically personalize vocabulary, sentence-construction exercises, and corrective feedback; and (3) monitor neural markers of cognitive effort to adjust task difficulty on the fly.

[129] Voice-guided Orchestrated Intelligence for Clinical Evaluation (VOICE): A Voice AI Agent System for Prehospital Stroke Assessment

Julian Acosta,Scott Adams,Julius Kernbach,Romain Hardy,Sung Eun Kim,Luyang Luo,Xiaoman Zhang,Shreya Johri,Mohammed Baharoon,Pranav Rajpurkar

Main category: cs.HC

TL;DR: 该论文开发了一个基于语音的AI系统(VOICE),用于辅助非专业人士进行中风预评估,通过自然对话和智能手机视频记录关键检查内容,显著提高了中风识别的准确性和效率。

Details Motivation: 当前急救中风识别存在不一致性和低敏感性问题(低至58%),导致治疗延误。VOICE旨在通过语音AI系统提供专家级别的评估,弥补这一关键缺口。

Contribution: 1. 开发了首个语音驱动的AI系统,指导非专业人士完成专家级中风评估;2. 结合视频记录,支持后续专家复核;3. 在模拟测试中显示了较高的诊断准确性(84%的中风特征识别率)和用户接受度。

Method: 设计了一个基于自然对话的语音AI系统,引导用户逐步完成中风评估,并通过智能手机记录关键检查内容。测试中,三名非医疗志愿者使用该系统评估模拟中风患者,测量准确性、完成时间、用户信心及专家复核效果。

Result: 系统正确识别84%的中风特征和75%可能的大血管闭塞(LVO),评估时间约6分钟。用户信心高(4.5/5),易用性评分4.67/5。专家复核正确率100%,但AI错误导致仅40%的病例能初步决策。

Insight: 尽管当前系统需人工监督,但其潜力显著。未来语音AI的快速进步可能实现高度准确评估,从而将专家级能力普及到普通人群中,革新急诊医疗。

Abstract: We developed a voice-driven artificial intelligence (AI) system that guides anyone - from paramedics to family members - through expert-level stroke evaluations using natural conversation, while also enabling smartphone video capture of key examination components for documentation and potential expert review. This addresses a critical gap in emergency care: current stroke recognition by first responders is inconsistent and often inaccurate, with sensitivity for stroke detection as low as 58%, causing life-threatening delays in treatment. Three non-medical volunteers used our AI system to assess ten simulated stroke patients, including cases with likely large vessel occlusion (LVO) strokes and stroke-like conditions, while we measured diagnostic accuracy, completion times, user confidence, and expert physician review of the AI-generated reports. The AI system correctly identified 84% of individual stroke signs and detected 75% of likely LVOs, completing evaluations in just over 6 minutes. Users reported high confidence (median 4.5/5) and ease of use (mean 4.67/5). The system successfully identified 86% of actual strokes but also incorrectly flagged 2 of 3 non-stroke cases as strokes. When an expert physician reviewed the AI reports with videos, they identified the correct diagnosis in 100% of cases, but felt confident enough to make preliminary treatment decisions in only 40% of cases due to observed AI errors including incorrect scoring and false information. While the current system’s limitations necessitate human oversight, ongoing rapid advancements in speech-to-speech AI models suggest that future versions are poised to enable highly accurate assessments. Achieving human-level voice interaction could transform emergency medical care, putting expert-informed assessment capabilities in everyone’s hands.

[130] iLearnRobot: An Interactive Learning-Based Multi-Modal Robot with Continuous Improvement

Kohou Wang,ZhaoXiang Liu,Lin Bai,Kun Fan,Xiang Liu,Huan Hu,Kai Wang,Shiguo Lian

Main category: cs.HC

TL;DR: 这篇论文提出了一种基于多模态大语言模型(MLLM)的交互式学习机器人系统,能够通过与用户的自然对话持续改进性能。

Details Motivation: 机器人部署后可能会遇到从未见过的新场景,因此需要一种能够在实际使用中持续学习和改进的系统。现有的主流MLLM机器人系统缺乏这种交互式学习能力,无法避免重复错误。

Contribution: 1) 提出了一种基于MLLM的交互式学习机器人系统,能够从非专家用户的自然对话中学习;2) 引入了问题链机制,明确用户意图后再回答问题;3) 设计了双模态检索模块,利用交互事件避免重复错误。

Method: 1) 使用MLLM支持的自然对话交互;2) 通过问题链机制澄清用户意图;3) 利用双模态检索模块记录和优化交互事件。

Result: 实验从定量和定性两方面验证了该系统的有效性和持续改进能力。

Insight: 交互式学习为机器人提供了更灵活的自适应能力,未来在多模态和人机交互领域有广阔应用前景。

Abstract: It is crucial that robots’ performance can be improved after deployment, as they are inherently likely to encounter novel scenarios never seen before. This paper presents an innovative solution: an interactive learning-based robot system powered by a Multi-modal Large Language Model(MLLM). A key feature of our system is its ability to learn from natural dialogues with non-expert users. We also propose chain of question to clarify the exact intent of the question before providing an answer and dual-modality retrieval modules to leverage these interaction events to avoid repeating same mistakes, ensuring a seamless user experience before model updates, which is in contrast to current mainstream MLLM-based robotic systems. Our system marks a novel approach in robotics by integrating interactive learning, paving the way for superior adaptability and performance in diverse environments. We demonstrate the effectiveness and improvement of our method through experiments, both quantitively and qualitatively.

cs.GR [Back]

[131] Noise-Coded Illumination for Forensic and Photometric Video Analysis

Peter F. Michael,Zekun Hao,Serge Belongie,Abe Davis

Main category: cs.GR

TL;DR: 通过将微妙的噪声编码调制嵌入场景照明中,为视频添加时间水印,以对抗视频篡改,保护高价值内容。

Details Motivation: 随着视频篡改工具的普及,伪造视频越来越难以辨别。本文旨在通过照明编码创造信息不对称,使验证方占据优势。

Contribution: 提出了一种新颖的噪声编码照明技术,为视频添加时间水印,即使对手知道技术细节,也难以伪造。

Method: 在场景照明中嵌入噪声样的细微调制,生成时间水印。水印记录了未篡改场景在编码照明下的图像信息。

Result: 该技术能够有效对抗视频篡改,尤其是在高价值场景(如公共活动、访谈)中,即使对手知情也难以伪造。

Insight: 通过控制照明条件创建信息不对称,为视频防伪提供了新思路,适用于无法控制摄像头的场景。

Abstract: The proliferation of advanced tools for manipulating video has led to an arms race, pitting those who wish to sow disinformation against those who want to detect and expose it. Unfortunately, time favors the ill-intentioned in this race, with fake videos growing increasingly difficult to distinguish from real ones. At the root of this trend is a fundamental advantage held by those manipulating media: equal access to a distribution of what we consider authentic (i.e., “natural”) video. In this paper, we show how coding very subtle, noise-like modulations into the illumination of a scene can help combat this advantage by creating an information asymmetry that favors verification. Our approach effectively adds a temporal watermark to any video recorded under coded illumination. However, rather than encoding a specific message, this watermark encodes an image of the unmanipulated scene as it would appear lit only by the coded illumination. We show that even when an adversary knows that our technique is being used, creating a plausible coded fake video amounts to solving a second, more difficult version of the original adversarial content creation problem at an information disadvantage. This is a promising avenue for protecting high-stakes settings like public events and interviews, where the content on display is a likely target for manipulation, and while the illumination can be controlled, the cameras capturing video cannot.