Table of Contents

cs.CL [Back]

[1] No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

Iván Vicente Moreno Cencerrado,Arnau Padrés Masdemont,Anton Gonzalvez Hawthorne,David Demitri Africa,Lorenzo Pacchiardi

Main category: cs.CL

TL;DR: 该论文研究了如何在不生成答案的情况下,仅通过问题激活的线性探针预测大型语言模型(LLM)回答的正确性。

Details Motivation: 探究LLM是否能在未生成答案前预测自身回答的正确性,从而揭示其内部机制。

Contribution: 提出了一种基于线性探针的方法,仅通过问题激活预测答案正确性,并在多种模型上验证了其泛化能力。

Method: 提取问题输入后的模型激活,训练线性探针预测后续答案正确性,并分析了不同层的表现。

Result: 该方法在通用知识问题上表现优异,但数学推理能力较弱;同时发现自我评估能力在模型中间层出现。

Insight: LLM的自我评估能力在中间层达到峰值,且其‘我不知道’回答与探针分数高度相关,表明内部机制可用于捕捉置信度。

Abstract: Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model’s forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this “in-advance correctness direction” trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers, suggesting that self-assessment emerges mid-computation. Notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding “I don’t know”, doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

[2] Context Copying Modulation: The Role of Entropy Neurons in Managing Parametric and Contextual Knowledge Conflicts

Zineddine Tighidet,Andrea Mogini,Hedi Ben-younes,Jiali Mei,Patrick Gallinari,Benjamin Piwowarski

Main category: cs.CL

TL;DR: 论文研究了熵神经元在大语言模型中抑制上下文复制行为的作用,揭示了其在处理参数知识与上下文冲突中的关键角色。

Details Motivation: 大语言模型在处理上下文与内部参数知识冲突时的行为不一致,缺乏明确解释。研究者希望通过分析熵神经元的作用,理解模型内部动态。

Contribution: 证明了熵神经元在抑制上下文复制行为中的核心作用,为理解LLM处理冲突信息的机制提供了新视角。

Method: 通过分析不同LLM中熵神经元的特性,研究其在上下文与参数冲突时的行为,并通过消融实验验证其影响。

Result: 消融熵神经元显著改变了生成过程,证实其对上下文复制的抑制功能。

Insight: 熵神经元在LLM中扮演了调节冲突信息的关键角色,为模型行为不一致性提供了潜在解释。

Abstract: The behavior of Large Language Models (LLMs) when facing contextual information that conflicts with their internal parametric knowledge is inconsistent, with no generally accepted explanation for the expected outcome distribution. Recent work has identified in autoregressive transformer models a class of neurons – called entropy neurons – that produce a significant effect on the model output entropy while having an overall moderate impact on the ranking of the predicted tokens. In this paper, we investigate the preliminary claim that these neurons are involved in inhibiting context copying behavior in transformers by looking at their role in resolving conflicts between contextual and parametric information. We show that entropy neurons are responsible for suppressing context copying across a range of LLMs, and that ablating them leads to a significant change in the generation process. These results enhance our understanding of the internal dynamics of LLMs when handling conflicting information.

[3] A Survey on Retrieval And Structuring Augmented Generation with Large Language Models

Pengcheng Jiang,Siru Ouyang,Yizhu Jiao,Ming Zhong,Runchu Tian,Jiawei Han

Main category: cs.CL

TL;DR: 这篇论文是一篇关于大语言模型(LLMs)中检索与结构化增强生成(RAS)的综述,讨论了如何通过动态信息检索和结构化知识表示解决LLMs在现实应用中面临的挑战,包括幻觉生成、知识过时和领域专业知识有限等问题。

Details Motivation: 大语言模型尽管在文本生成和推理方面表现出色,但在实际应用中仍面临幻觉生成、知识过时和领域知识不足等问题。为了克服这些局限性,作者探讨了检索与结构化增强生成(RAS)作为解决方案。

Contribution: 论文的主要贡献包括:(1) 系统梳理了检索机制(稀疏、稠密和混合方法)和文本结构化技术;(2) 探讨了结构化表示与LLMs的融合方法;(3) 指出了该领域的技术挑战和未来研究方向。

Method: 论文采用了综述方法,总结了检索机制(如稀疏、稠密和混合方法)、文本结构化技术(如分类、分层和信息提取),以及这些技术与LLMs的集成方式(如基于提示的方法和知识嵌入技术)。

Result: 论文总结了现有RAS方法的优势和局限性,并指出了未来在跨模态检索、多语言结构和交互式系统等方向的研究机会。

Insight: RAS方法通过动态检索和结构化知识表示,显著提升了LLMs在生成质量和知识准确性方面的表现,但其效率和结构质量仍需进一步优化。

Abstract: Large Language Models (LLMs) have revolutionized natural language processing with their remarkable capabilities in text generation and reasoning. However, these models face critical challenges when deployed in real-world applications, including hallucination generation, outdated knowledge, and limited domain expertise. Retrieval And Structuring (RAS) Augmented Generation addresses these limitations by integrating dynamic information retrieval with structured knowledge representations. This survey (1) examines retrieval mechanisms including sparse, dense, and hybrid approaches for accessing external knowledge; (2) explore text structuring techniques such as taxonomy construction, hierarchical classification, and information extraction that transform unstructured text into organized representations; and (3) investigate how these structured representations integrate with LLMs through prompt-based methods, reasoning frameworks, and knowledge embedding techniques. It also identifies technical challenges in retrieval efficiency, structure quality, and knowledge integration, while highlighting research opportunities in multimodal retrieval, cross-lingual structures, and interactive systems. This comprehensive overview provides researchers and practitioners with insights into RAS methods, applications, and future directions.

[4] SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

Iman Barati,Mostafa Amiri,Heshaam Faili

Main category: cs.CL

TL;DR: SearchInstruct提出了一种基于检索的方法,用于构建高质量的领域适应指令数据集,通过动态检索相关资源生成答案,显著提升了SFT数据集的质量和多样性,进而增强了LLM在特定领域的性能。

Details Motivation: 在特定领域中,由于数据稀缺和领域约束,创建适合SFT的训练数据集具有挑战性。SearchInstruct旨在解决这一问题,通过结合人类生成的问题和检索技术,高效构建高质量的指令数据集。

Contribution: 1. 提出了一种基于检索的指令数据集构建方法SearchInstruct;2. 通过动态检索领域资源生成高质量答案;3. 验证了该方法在SFT和模型编辑中的有效性,并提供了开源实现。

Method: 1. 从少量领域特定的人类生成问题出发;2. 利用大语言模型扩展问题;3. 动态检索领域相关资源生成答案。

Result: 实验表明,SearchInstruct显著提升了SFT数据集的质量和多样性,并在特定领域任务中改进了LLM性能,同时在模型编辑任务中表现高效。

Insight: 动态检索与LLM结合的方法可用于高效生成领域适应数据集,为数据稀缺问题提供了新思路,同时展示了在模型编辑中的潜力。

Abstract: Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: https://github.com/mostafaamiri/SearchInstruct

[5] Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs

Mobina Pournemat,Keivan Rezaei,Gaurang Sriramanan,Arman Zarei,Jiaxiang Fu,Yang Wang,Hamid Eghbalzadeh,Soheil Feizi

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLMs)在概率推理任务中的表现,揭示了其在模式识别、最大似然估计和样本生成中的能力差异及局限性。

Details Motivation: 尽管LLMs在语言理解和生成任务上表现优异,但其在需要概率推理的任务中行为不一致且表现模糊,因此需要系统评估其能力。

Contribution: 首次全面研究了LLMs在离散概率分布上的推理能力,并设计了三个任务(模式识别、最大似然估计、样本生成)来评估模型表现。

Method: 通过提示LLMs回答关于联合分布或条件分布的查询,测试其在频率分析、边缘化和生成行为方面的能力。

Result: 实验表明,较大模型在推理和样本生成上表现更优,但也存在对概率表示符号敏感和上下文长度增加时性能下降(超过60%)的局限性。

Insight: LLMs在概率推理任务中表现出显著差异,未来改进需关注其对符号的鲁棒性和上下文长度的扩展能力。

Abstract: Despite widespread success in language understanding and generation, large language models (LLMs) exhibit unclear and often inconsistent behavior when faced with tasks that require probabilistic reasoning. In this work, we present the first comprehensive study of the reasoning capabilities of LLMs over explicit discrete probability distributions. Given observations from a probability distribution, we evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation, by prompting them to provide responses to queries about either the joint distribution or its conditionals. These tasks thus probe a range of probabilistic skills, including frequency analysis, marginalization, and generative behavior. Through comprehensive empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models, with the latter demonstrating stronger inference and surprising capabilities in sample generation. Furthermore, our investigations reveal notable limitations, including sensitivity to variations in the notation utilized to represent probabilistic outcomes and performance degradation of over 60% as context length increases. Together, our results provide a detailed understanding of the probabilistic reasoning abilities of LLMs and identify key directions for future improvement.

[6] Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models

Ozan Gokdemir,Neil Getty,Robert Underwood,Sandeep Madireddy,Franck Cappello,Arvind Ramanathan,Ian T. Foster,Rick L. Stevens

Main category: cs.CL

TL;DR: 论文提出了一种自动化生成MCQA(多项选择题)评估框架,通过解析科学论文PDF生成问题,并在小语言模型上验证了推理追踪检索的性能提升。

Details Motivation: 科学知识快速膨胀,需要动态的评估基准来测试语言模型对最新多样文献的理解能力。

Contribution: 1. 提出了一种模块化、可扩展的MCQA生成框架。2. 在癌症生物学领域生成了大规模MCQA基准。3. 验证了推理追踪检索对小语言模型性能的提升。

Method: 1. PDF解析、语义分块、问题生成全流程自动化。2. 结合检索增强生成(RAG)和GPT-4推理追踪提升模型效果。

Result: 推理追踪检索显著提升了小语言模型的性能,某些模型甚至超过GPT-4在专业考试上的表现。

Insight: 推理追踪可以作为小型语言模型领域适应的有效检索源,为资源受限场景提供高性能解决方案。

Abstract: As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B-14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.

[7] RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems

Adarsh Srinivasan,Jacob Dineen,Muhammad Umar Afzal,Muhammad Uzair Sarfraz,Irbaz B. Riaz,Ben Zhou

Main category: cs.CL

TL;DR: RECAP是一种无需重新训练的推理时框架,通过透明的情感评估理论阶段,提升医疗对话系统中情感理解与表达的能力。

Details Motivation: 在医疗对话中,患者常处于脆弱状态,需要共情沟通以增强安全感和信任。现有的大型语言模型通常缺乏情感理解,导致回应生硬。RECAP旨在填补这一空白。

Contribution: 提出了RECAP框架,通过分解共情为透明评估阶段,生成更具情感深度的回应,同时支持可审计性。

Method: RECAP包含五个步骤:Reflect(反思)、Extract(提取)、Calibrate(校准)、Align(对齐)和Produce(生成)。通过结构化情感推理,结合Likert量表信号,优化模型输出。

Result: 在多个评测基准(EmoBench、SECEU、EQ-Bench)中,RECAP显著提升了情感推理能力(8B模型提升22-28%,更大模型提升10-13%)。临床评估也验证了其共情表达优势。

Insight: 模块化、理论驱动的提示方法可以系统性地增强AI的情感智能,同时保留部署所需的可审计性。

Abstract: Large language models in healthcare often miss critical emotional cues, delivering medically sound but emotionally flat advice. This is especially problematic in clinical contexts where patients are distressed and vulnerable, and require empathic communication to support safety, adherence, and trust. We present RECAP (Reflect-Extract-Calibrate-Align-Produce), an inference-time framework that adds structured emotional reasoning without retraining. By decomposing empathy into transparent appraisal-theoretic stages and exposing per-dimension Likert signals, RECAP produces nuanced, auditable responses. Across EmoBench, SECEU, and EQ-Bench, RECAP improves emotional reasoning by 22-28% on 8B models and 10-13% on larger models over zero-shot baselines. Clinician evaluations further confirm superior empathetic communication. RECAP shows that modular, theory-grounded prompting can systematically enhance emotional intelligence in medical AI while preserving the accountability required for deployment.

[8] A funny companion: Distinct neural responses to perceived AI- versus human- generated humor

Xiaohui Rao,Hanlin Wu,Zhenguang G. Cai

Main category: cs.CL

TL;DR: 本文通过脑电图(EEG)研究了人们对AI与人类幽默的认知和情感反应,发现尽管行为上两者幽默评分相似,但神经生理数据显示AI幽默需更少认知努力并引发更强情绪反应,挑战了“算法厌恶”现象。

Details Motivation: 随着AI伴侣能够进行类似人类的交流(包括讲笑话),了解人们对AI幽默的认知和情感反应变得愈发重要。

Contribution: 主要贡献在于揭示了AI幽默与人类幽默在神经生理层面的差异,以及AI幽默如何通过动态的认知适应增强情感奖励。

Method: 研究采用EEG技术,对比分析了参与者在听到AI和人类幽默时的脑电波反应(如N400和LPP),并结合行为问卷收集数据。

Result: 行为数据显示AI和人类幽默评分相似,但神经数据显示AI幽默引发更小的N400(认知努力更低)和更大的LPP(情绪更强)。此外,AI幽默的神经反应随时间呈现效率提升和情感奖励增强的动态变化。

Insight: 研究挑战了“算法厌恶”现象,表明大脑能够动态更新对AI能力的预测模型,且AI幽默可通过累积强化促进真实的人机社交互动。参与者的社会态度(如对AI的信任)也影响神经反应。

Abstract: As AI companions become capable of human-like communication, including telling jokes, understanding how people cognitively and emotionally respond to AI humor becomes increasingly important. This study used electroencephalography (EEG) to compare how people process humor from AI versus human sources. Behavioral analysis revealed that participants rated AI and human humor as comparably funny. However, neurophysiological data showed that AI humor elicited a smaller N400 effect, suggesting reduced cognitive effort during the processing of incongruity. This was accompanied by a larger Late Positive Potential (LPP), indicating a greater degree of surprise and emotional response. This enhanced LPP likely stems from the violation of low initial expectations regarding AI’s comedic capabilities. Furthermore, a key temporal dynamic emerged: human humor showed habituation effects, marked by an increasing N400 and a decreasing LPP over time. In contrast, AI humor demonstrated increasing processing efficiency and emotional reward, with a decreasing N400 and an increasing LPP. This trajectory reveals how the brain can dynamically update its predictive model of AI capabilities. This process of cumulative reinforcement challenges “algorithm aversion” in humor, as it demonstrates how cognitive adaptation to AI’s language patterns can lead to an intensified emotional reward. Additionally, participants’ social attitudes toward AI modulated these neural responses, with higher perceived AI trustworthiness correlating with enhanced emotional engagement. These findings indicate that the brain responds to AI humor with surprisingly positive and intense reactions, highlighting humor’s potential for fostering genuine engagement in human-AI social interaction.

[9] Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue

Sangyeop Kim,Yohan Lee,Sanghwa Kim,Hyunjong Kim,Sungzoon Cho

Main category: cs.CL

TL;DR: PREMem将复杂的推理过程从响应生成转移到记忆构建中,通过预存储推理提升会话AI的长期记忆能力,减少推理时的计算负担。

Details Motivation: 当前会话AI系统在长期记忆中需要跨多会话综合信息,但推理负担过重且依赖于模型规模。

Contribution: 提出PREMem方法,通过预存储推理提取细粒度记忆片段(事实、经验、主观信息),并建立跨会话的记忆关系。

Method: 在记忆构建阶段进行复杂推理,提取和分类记忆片段,并建立记忆间的显式关系和演化模式。

Result: 实验表明,PREMem显著提升了性能,小模型也能达到与大基线相当的结果,且适用于受限的token预算场景。

Insight: 将推理前移至记忆构建阶段,可以有效平衡计算负担与性能,同时提升小模型的表现能力。

Abstract: Effective long-term memory in conversational AI requires synthesizing information across multiple sessions. However, current systems place excessive reasoning burden on response generation, making performance significantly dependent on model sizes. We introduce PREMem (Pre-storage Reasoning for Episodic Memory), a novel approach that shifts complex reasoning processes from inference to memory construction. PREMem extracts fine-grained memory fragments categorized into factual, experiential, and subjective information; it then establishes explicit relationships between memory items across sessions, capturing evolution patterns like extensions, transformations, and implications. By performing this reasoning during pre-storage rather than when generating a response, PREMem creates enriched representations while reducing computational demands during interactions. Experiments show significant performance improvements across all model sizes, with smaller models achieving results comparable to much larger baselines while maintaining effectiveness even with constrained token budgets. Code and dataset are available at https://github.com/sangyeop-kim/PREMem.

[10] Quantifier Scope Interpretation in Language Learners and LLMs

Shaohua Fang,Yue Li,Yan Cong

Main category: cs.CL

TL;DR: 论文研究了大型语言模型(LLMs)在英语和汉语中对量化词范围解释的偏好,发现大多数LLMs倾向于表层解释,与人类相似,但模型架构和预训练数据语言背景显著影响其与人类行为的一致性。

Details Motivation: 量化词在句子中可能引发解释歧义,且不同语言中表现不同。研究旨在探究LLMs是否能模拟人类在多语言中对量化词范围的解释偏好。

Contribution: 揭示了LLMs在量化词范围解释上倾向于与人类相似的表层解释,并发现模型架构和预训练数据语言背景对行为一致性的关键影响。

Method: 采用跨语言方法,结合概率评估解释可能性,并使用人类相似性(HS)分数量化LLMs与人类表现的接近程度。

Result: 大多数LLMs偏好表层解释,部分模型在英语和汉语中表现出相反的偏好,模型架构和预训练数据语言背景显著影响结果。

Insight: LLMs在量化词解释上显示出与人类的一致性潜力,但模型的设计和数据背景是关键变量。

Abstract: Sentences with multiple quantifiers often lead to interpretive ambiguities, which can vary across languages. This study adopts a cross-linguistic approach to examine how large language models (LLMs) handle quantifier scope interpretation in English and Chinese, using probabilities to assess interpretive likelihood. Human similarity (HS) scores were used to quantify the extent to which LLMs emulate human performance across language groups. Results reveal that most LLMs prefer the surface scope interpretations, aligning with human tendencies, while only some differentiate between English and Chinese in the inverse scope preferences, reflecting human-similar patterns. HS scores highlight variability in LLMs’ approximation of human behavior, but their overall potential to align with humans is notable. Differences in model architecture, scale, and particularly models’ pre-training data language background, significantly influence how closely LLMs approximate human quantifier scope interpretations.

[11] EmoBench-Reddit: A Hierarchical Benchmark for Evaluating the Emotional Intelligence of Multimodal Large Language Models

Haokun Li,Yazhou Zhang,Jizhi Ding,Qiuchi Li,Peng Zhang

Main category: cs.CL

TL;DR: EmoBench-Reddit是一个新颖的分层基准测试,用于评估多模态大语言模型(MLLMs)的情绪理解能力。它包含350个从Reddit精选的多模态样本,并通过逐步增加难度的任务框架评估模型从基本感知到高级认知的能力。

Details Motivation: 当前的多模态基准测试主要关注客观任务(如视觉问答或图像描述),而忽视了模型理解复杂主观情绪的能力。EmoBench-Reddit填补了这一空白。

Contribution: 提出了EmoBench-Reddit,一个专注于情绪理解的分层基准测试;设计了从感知到认知的渐进任务框架;确保了高质量的数据标注。

Method: 数据集包含350个Reddit样本(图像、文本和情绪标签);任务设计包括6个选择题和1个开放性问题,难度递增;使用AI辅助(Claude 4)和人工验证确保标注质量。

Result: 通过分层任务框架,EmoBench-Reddit能够全面评估模型的情绪理解能力,包括识别情绪、场景推理和共情等高级能力。

Insight: 情绪理解是多模态模型智能的重要维度,未来的基准测试需要更多关注主观和复杂的人类情感能力评估。

Abstract: With the rapid advancement of Multimodal Large Language Models (MLLMs), they have demonstrated exceptional capabilities across a variety of vision-language tasks. However, current evaluation benchmarks predominantly focus on objective visual question answering or captioning, inadequately assessing the models’ ability to understand complex and subjective human emotions. To bridge this gap, we introduce EmoBench-Reddit, a novel, hierarchical benchmark for multimodal emotion understanding. The dataset comprises 350 meticulously curated samples from the social media platform Reddit, each containing an image, associated user-provided text, and an emotion category (sad, humor, sarcasm, happy) confirmed by user flairs. We designed a hierarchical task framework that progresses from basic perception to advanced cognition, with each data point featuring six multiple-choice questions and one open-ended question of increasing difficulty. Perception tasks evaluate the model’s ability to identify basic visual elements (e.g., colors, objects), while cognition tasks require scene reasoning, intent understanding, and deep empathy integrating textual context. We ensured annotation quality through a combination of AI assistance (Claude 4) and manual verification.

[12] We Argue to Agree: Towards Personality-Driven Argumentation-Based Negotiation Dialogue Systems for Tourism

Priyanshu Priya,Saurav Dudhate,Desai Vishesh Yasheshbhai,Asif Ekbal

Main category: cs.CL

TL;DR: 该论文提出了一种新颖的个性驱动基于论证的谈判对话生成(PAN-DG)任务,并引入了PACT数据集,通过结合不同个性特征提升谈判对话系统的适应力和个性化能力。

Details Motivation: 现有的谈判对话系统缺乏个性适应性和论证能力,限制了其在真实场景中的实用性。

Contribution: 1. 提出了PAN-DG任务;2. 发布了PACT数据集,模拟多样化的个性谈判场景;3. 验证了微调大语言模型在生成个性驱动响应中的有效性。

Method: 利用大语言模型生成含三种个性特征的谈判对话数据集(PACT),并通过预训练和微调模型进行对比实验。

Result: 实验表明,微调后的模型能有效生成个性驱动的谈判响应,PACT数据集质量高。

Insight: 个性特征的引入能显著提升谈判对话系统的适应性和个性化能力,为未来研究方向奠定基础。

Abstract: Integrating argumentation mechanisms into negotiation dialogue systems improves conflict resolution through exchanges of arguments and critiques. Moreover, incorporating personality attributes enhances adaptability by aligning interactions with individuals’ preferences and styles. To advance these capabilities in negotiation dialogue systems, we propose a novel Personality-driven Argumentation-based Negotiation Dialogue Generation (PAN-DG) task. To support this task, we introduce PACT, a dataset of Personality-driven Argumentation-based negotiation Conversations for Tourism sector. This dataset, generated using Large Language Models (LLMs), features three distinct personality profiles, viz. Argumentation Profile, Preference Profile, and Buying Style Profile to simulate a variety of negotiation scenarios involving diverse personalities. Thorough automatic and manual evaluations indicate that the dataset comprises high-quality dialogues. Further, we conduct comparative experiments between pre-trained and fine-tuned LLMs for the PAN-DG task. Multi-dimensional evaluation demonstrates that the fine-tuned LLMs effectively generate personality-driven rational responses during negotiations. This underscores the effectiveness of PACT in enhancing personalization and reasoning capabilities in negotiation dialogue systems, thereby establishing a foundation for future research in this domain.

[13] Joint Effects of Argumentation Theory, Audio Modality and Data Enrichment on LLM-Based Fallacy Classification

Hongxu Zhou,Hylke Westerdijk,Khondoker Ittehadul Islam

Main category: cs.CL

TL;DR: 该论文探讨了背景和情感音调元数据对大型语言模型(LLM)在谬误分类任务中的影响。通过多种提示策略和两种理论框架评估,研究发现增强输入(如情感元数据)反而会降低模型性能。

Details Motivation: 研究动机在于理解如何通过理论框架和输入增强(如情感元数据)提升LLM在政治辩论场景中的谬误分类能力。

Contribution: 主要贡献包括:提出两种理论提示框架(Pragma-Dialectics和Periodic Table of Arguments),并验证了情感元数据和背景对分类性能的实际影响。

Method: 采用Qwen-3(8B)模型,通过文本、上下文及情感音调元数据三种输入设置,比较了理论提示框架与基线提示的效果。

Result: 结果表明,理论提示框架虽能提升可解释性,但情感元数据会降低性能,导致模型偏向于标记为“诉诸情感”谬误。简单提示常优于增强输入。

Insight: 研究揭示了输入增强可能因注意力分散而损害LLM的分类性能,情感元数据的加入不仅无益反而可能削弱逻辑推理能力。

Abstract: This study investigates how context and emotional tone metadata influence large language model (LLM) reasoning and performance in fallacy classification tasks, particularly within political debate settings. Using data from U.S. presidential debates, we classify six fallacy types through various prompting strategies applied to the Qwen-3 (8B) model. We introduce two theoretically grounded Chain-of-Thought frameworks: Pragma-Dialectics and the Periodic Table of Arguments, and evaluate their effectiveness against a baseline prompt under three input settings: text-only, text with context, and text with both context and audio-based emotional tone metadata. Results suggest that while theoretical prompting can improve interpretability and, in some cases, accuracy, the addition of context and especially emotional tone metadata often leads to lowered performance. Emotional tone metadata biases the model toward labeling statements as \textit{Appeal to Emotion}, worsening logical reasoning. Overall, basic prompts often outperformed enhanced ones, suggesting that attention dilution from added inputs may worsen rather than improve fallacy classification in LLMs.

[14] When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs’ Toxicity

Shiyao Cui,Xijia Feng,Yingkang Wang,Junxiao Yang,Zhexin Zhang,Biplab Sikdar,Hongning Wang,Han Qiu,Minlie Huang

Main category: cs.CL

TL;DR: Emojis may unexpectedly trigger toxic content generation in large language models (LLMs), bypassing safety mechanisms. The study investigates this phenomenon and its semantic and corpus-level causes.

Details Motivation: Emojis are typically friendly, but observations show they can prompt toxic responses in LLMs, raising safety concerns.

Contribution: 1. Demonstrates emojis’ role in enhancing toxicity in LLMs. 2. Provides semantic and corpus-level interpretations for this behavior.

Method: Automated prompt construction with emojis, experiments across 5 languages and 7 LLMs, and semantic/tokenization analysis.

Result: Emojis easily induce toxicity in LLMs. Semantic channels and pre-training data pollution contribute to this behavior.

Insight: Emojis can bypass safety mechanisms due to their heterogeneous semantics, highlighting vulnerabilities in LLM safety design.

Abstract: Emojis are globally used non-verbal cues in digital communication, and extensive research has examined how large language models (LLMs) understand and utilize emojis across contexts. While usually associated with friendliness or playfulness, it is observed that emojis may trigger toxic content generation in LLMs. Motivated by such a observation, we aim to investigate: (1) whether emojis can clearly enhance the toxicity generation in LLMs and (2) how to interpret this phenomenon. We begin with a comprehensive exploration of emoji-triggered LLM toxicity generation by automating the construction of prompts with emojis to subtly express toxic intent. Experiments across 5 mainstream languages on 7 famous LLMs along with jailbreak tasks demonstrate that prompts with emojis could easily induce toxicity generation. To understand this phenomenon, we conduct model-level interpretations spanning semantic cognition, sequence generation and tokenization, suggesting that emojis can act as a heterogeneous semantic channel to bypass the safety mechanisms. To pursue deeper insights, we further probe the pre-training corpus and uncover potential correlation between the emoji-related data polution with the toxicity generation behaviors. Supplementary materials provide our implementation code and data. (Warning: This paper contains potentially sensitive contents)

[15] The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences

Valentin Romanov,Steven A Niederer

Main category: cs.CL

TL;DR: 该论文总结了58种文本提示工程技术,聚焦于6种核心方法,旨在为生命科学领域的研究者提供高效、低摩擦的系统化提示工程实践指南,以提高研究质量。

Details Motivation: 开发有效的提示需要大量的认知投入,而现有方法繁多且复杂。论文旨在通过简化这些方法,为生命科学领域的研究者提供高效的提示工程实践指南。

Contribution: 论文从58种方法中提炼出6种核心提示工程技术(零样本、少量样本、思维生成、集成、自我批判和分解),并结合生命科学的实际用例提供了详细的结构化建议。

Method: 通过对多种提示工程技术的分析与总结,论文提出了6种核心方法,并针对生命科学领域的常见任务(如文献综述、数据提取和编辑任务)提供了结构化建议。

Result: 论文展示了这些核心提示工程技术在生命科学领域的实际应用效果,并分析了当前平台(如OpenAI、Google等)及其工具的局限性。

Insight: 提示工程技术可以增强而非取代现有的数据处理和文档编辑实践,帮助研究者从机会性提示转向高效、系统化的低摩擦实践。

Abstract: Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs). By deploying case-specific prompt engineering techniques that streamline frequently performed life sciences workflows, researchers could achieve substantial efficiency gains that far exceed the initial time investment required to master these techniques. The Prompt Report published in 2025 outlined 58 different text-based prompt engineering techniques, highlighting the numerous ways prompts could be constructed. To provide actionable guidelines and reduce the friction of navigating these various approaches, we distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition. We breakdown the significance of each approach and ground it in use cases relevant to life sciences, from literature summarization and data extraction to editorial tasks. We provide detailed recommendations for how prompts should and shouldn’t be structured, addressing common pitfalls including multi-turn conversation degradation, hallucinations, and distinctions between reasoning and non-reasoning models. We examine context window limitations, agentic tools like Claude Code, while analyzing the effectiveness of Deep Research tools across OpenAI, Google, Anthropic and Perplexity platforms, discussing current limitations. We demonstrate how prompt engineering can augment rather than replace existing established individual practices around data processing and document editing. Our aim is to provide actionable guidance on core prompt engineering principles, and to facilitate the transition from opportunistic prompting to an effective, low-friction systematic practice that contributes to higher quality research.

[16] Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context

Dasol Choi,Jungwhan Kim,Guijin Son

Main category: cs.CL

TL;DR: 论文提出了Ko-PIQA,一个包含文化背景的韩国物理常识推理数据集,填补了现有数据集文化多样性的不足。

Details Motivation: 现有的物理常识推理数据集(如PIQA)以英语为中心,缺乏文化多样性。论文旨在解决这一问题,并促进更多元化的常识推理研究。

Contribution: Ko-PIQA是一个高质量的韩语物理常识推理数据集,包含文化特定元素,为韩语语言模型提供了新基准。

Method: 采用多阶段过滤方法,结合语言模型筛选和GPT-4o优化,最终通过人工验证得到高质量的441个问答对。

Result: 在Ko-PIQA上测试的七个语言模型中,最佳模型准确率为83.22%,最差为59.86%,模型在文化特定场景中表现最差。

Insight: 文化特定的常识推理对语言模型提出了更高要求,凸显了多元文化数据集的重要性。Ko-PIQA为韩语及文化多样性研究提供了重要基础。

Abstract: Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity. We introduce Ko-PIQA, a Korean physical commonsense reasoning dataset that incorporates cultural context. Starting from 3.01 million web-crawled questions, we employed a multi-stage filtering approach using three language models to identify 11,553 PIQA-style questions. Through GPT-4o refinement and human validation, we obtained 441 high-quality question-answer pairs. A key feature of Ko-PIQA is its cultural grounding: 19.7% of questions contain culturally specific elements like traditional Korean foods (kimchi), clothing (hanbok), and specialized appliances (kimchi refrigerators) that require culturally-aware reasoning beyond direct translation. We evaluate seven language models on Ko-PIQA, with the best model achieving 83.22% accuracy while the weakest reaches only 59.86%, demonstrating significant room for improvement. Models particularly struggle with culturally specific scenarios, highlighting the importance of culturally diverse datasets. Ko-PIQA serves as both a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research. The dataset and code will be publicly available.

[17] !MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning

Mohamed Tarek,Seif Ahmed,Mohamed Basem

Main category: cs.CL

TL;DR: 该论文介绍了在AraHealthQA-2025共享任务中,通过提示工程和集成学习方法提升LLM在阿拉伯语临床问答性能的系统,获得了多项子任务的第二名。

Details Motivation: 在阿拉伯语临床问答任务中,提高大型语言模型(LLM)的性能,尤其是在多选择和开放式问答任务中的表现。

Contribution: 1) 在Sub-Task 1中,采用Gemini 2.5 Flash模型结合少样本提示和集成学习提升分类准确性。2) 在Sub-Task 2中,通过统一提示和角色扮演等方法生成简洁的阿拉伯医学问答回答。

Method: 1) Sub-Task 1: 使用Gemini 2.5 Flash模型,结合少样本提示、数据集预处理和三提示配置集成。2) Sub-Task 2: 使用同一模型,采用统一提示、角色扮演、少样本示例和后处理生成回答。

Result: 在AraHealthQA-2025共享任务的Track 2中,两个子任务均获得第二名。

Insight: 提示工程和集成学习可以有效提升LLM在特定语言和领域(如阿拉伯医学)的任务性能。

Abstract: We present our systems for Track 2 (General Arabic Health QA, MedArabiQ) of the AraHealthQA-2025 shared task, where our methodology secured 2nd place in both Sub-Task 1 (multiple-choice question answering) and Sub-Task 2 (open-ended question answering) in Arabic clinical contexts. For Sub-Task 1, we leverage the Gemini 2.5 Flash model with few-shot prompting, dataset preprocessing, and an ensemble of three prompt configurations to improve classification accuracy on standard, biased, and fill-in-the-blank questions. For Sub-Task 2, we employ a unified prompt with the same model, incorporating role-playing as an Arabic medical expert, few-shot examples, and post-processing to generate concise responses across fill-in-the-blank, patient-doctor Q&A, GEC, and paraphrased variants.

[18] Transformer Enhanced Relation Classification: A Comparative Analysis of Contextuality, Data Efficiency and Sequence Complexity

Bowen Jing,Yang Cui,Tianpeng Huang

Main category: cs.CL

TL;DR: 该论文系统比较了基于Transformer和非Transformer的深度学习方法在关系分类任务中的性能,发现前者显著优于后者,并探讨了LLM在关系提取中的作用。

Details Motivation: 在大语言模型时代,关系提取(RE)是信息提取的核心任务之一。然而,与非Transformer方法相比,Transformer方法的优势尚未得到系统评估。

Contribution: 1. 系统对比了Transformer和非Transformer方法在关系分类任务中的性能;2. 展示了Transformer方法在不同场景和数据集上的优势;3. 讨论了LLM在关系提取中的当前地位。

Method: 使用多种非Transformer(如PA-LSTM、C-GCN、AGGCN)和Transformer模型(如BERT、RoBERTa、R-BERT),在TACRED、TACREV和RE-TACRED数据集上进行实验,评估微F1分数等指标。

Result: Transformer模型显著优于非Transformer模型,微F1分数达到80-90%(非Transformer为64-67%)。

Insight: Transformer方法在关系分类任务中具有更高的性能和泛化能力,尤其是在处理长序列和有限数据时表现更好。

Abstract: In the era of large language model, relation extraction (RE) plays an important role in information extraction through the transformation of unstructured raw text into structured data (Wadhwa et al., 2023). In this paper, we systematically compare the performance of deep supervised learning approaches without transformers and those with transformers. We used a series of non-transformer architectures such as PA-LSTM(Zhang et al., 2017), C-GCN(Zhang et al., 2018), and AGGCN(attention guide GCN)(Guo et al., 2019), and a series of transformer architectures such as BERT, RoBERTa, and R-BERT(Wu and He, 2019). Our comparison included traditional metrics like micro F1, as well as evaluations in different scenarios, varying sentence lengths, and different percentages of the dataset for training. Our experiments were conducted on TACRED, TACREV, and RE-TACRED. The results show that transformer-based models outperform non-transformer models, achieving micro F1 scores of 80-90% compared to 64-67% for non-transformer models. Additionally, we briefly review the research journey in supervised relation classification and discuss the role and current status of large language models (LLMs) in relation extraction.

[19] CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Amirhossein Abaskohi,Raymond Li,Chuyuan Li,Shafiq Joty,Giuseppe Carenini

Main category: cs.CL

TL;DR: CEMTM是一种基于上下文嵌入的多模态主题建模方法,通过优化的大型视觉语言模型和分布注意力机制,实现对包含文本和图像的文档的语义一致性建模,并在多项基准测试中表现优异。

Details Motivation: 现有的主题建模方法在处理包含多模态数据(如文本和图像)的文档时,往往难以保持语义一致性,也无法有效处理每篇文档中的多张图像。

Contribution: CEMTM提出了一种上下文增强的多模态主题模型,能够通过分布注意力机制和重建目标,实现跨模态的语义一致性建模,并支持每篇文档中多张图像的处理。

Method: 方法基于优化的大型视觉语言模型生成上下文嵌入,使用分布注意力机制加权词级贡献,并通过重建目标对齐主题表示与文档嵌入。

Result: 在六项多模态基准测试中,CEMTM均优于单模态和多模态基线,平均LLM得分达到2.61,并在少样本检索任务中表现优秀。

Insight: CEMTM不仅在主题建模中表现优异,还能捕捉视觉相关的语义,适用于复杂领域(如科学文献)。

Abstract: We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

[20] Improving LLMs’ Learning for Coreference Resolution

Yujian Gan,Yuan Liang,Yanni Lin,Juntao Yu,Massimo Poesio

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型(LLM)在指代消解(CR)任务中的局限性,提出了两种新技术:反向训练与联合推理和迭代文档生成,显著提升了性能并减少了幻觉问题。

Details Motivation: 指代消解是许多NLP任务的基础,但现有LLM方法在QA模板和文档模板中存在幻觉和性能不足的问题,亟需改进。

Contribution: 提出了反向训练与联合推理(改进QA模板方法)和迭代文档生成(减少幻觉),并将两者结合形成了一个鲁棒的LLM指代消解解决方案。

Method: 1. 反向训练与联合推理:优化QA模板方法的训练方式;2. 迭代文档生成:通过多次生成文档减少幻觉问题。

Result: 实验表明,反向训练提升了QA模板方法的效果,迭代文档生成减少了源文本中的幻觉并提升了指代消解性能。

Insight: 结合反向训练和迭代文档生成可以有效解决LLM在指代消解中的幻觉和性能问题,为后续研究提供了新思路。

Abstract: Coreference Resolution (CR) is crucial for many NLP tasks, but existing LLMs struggle with hallucination and under-performance. In this paper, we investigate the limitations of existing LLM-based approaches to CR-specifically the Question-Answering (QA) Template and Document Template methods and propose two novel techniques: Reversed Training with Joint Inference and Iterative Document Generation. Our experiments show that Reversed Training improves the QA Template method, while Iterative Document Generation eliminates hallucinations in the generated source text and boosts coreference resolution. Integrating these methods and techniques offers an effective and robust solution to LLM-based coreference resolution.

[21] LVLMs are Bad at Overhearing Human Referential Communication

Zhengxiang Wang,Weiling Li,Panagiotis Kaliosis,Owen Rambow,Susan E. Brennan

Main category: cs.CL

TL;DR: 论文研究了7种先进的LVLMs在人类自发性对话中指代表达任务中的表现,发现它们在理解迭代对话任务中的指代表达时表现不佳。

Details Motivation: 研究动机是为评估LVLMs在真实世界任务中整合语言、视觉和对话交互的能力,尤其是作为旁观者理解人类对话中指代表达的能力。

Contribution: 主要贡献是:(1) 评估了LVLMs在复杂的人类对话任务中的表现;(2) 发布了一个新的对话语料库和代码,供未来研究使用。

Method: 通过实验测试了7种LVLMs在人类自发性对话中指代表达任务中的表现,对比了它们在多轮对话中的性能变化。

Result: 结果表明,所有测试的LVLMs在任务中表现不佳,且未显示出随着对话轮次增加的性能提升。

Insight: 当前LVLMs在理解人类复杂对话中的指代表达时仍有局限,需要进一步优化以更好地整合多模态输入和对话上下文。

Abstract: During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.

[22] HARP: Hallucination Detection via Reasoning Subspace Projection

Junjie Hu,Gang Tu,ShengYu Cheng,Jinxin Li,Jinting Wang,Rui Chen,Zhilong Zhou,Dongbo Shan

Main category: cs.CL

TL;DR: HARP提出了一种新颖的幻觉检测框架,通过将LLM隐状态空间分解为语义子空间和推理子空间,并利用SVD解耦两者,显著提升了幻觉检测的鲁棒性和性能。

Details Motivation: 大型语言模型中的幻觉问题是其可靠应用的主要障碍,现有方法在解耦语义和推理信息及保持鲁棒性方面存在不足。

Contribution: 1. 提出HARP框架,证明了LLM隐状态空间可分解为语义和推理子空间;2. 利用Unembedding层和SVD解耦子空间;3. 通过投影推理子空间特征实现高效检测。

Method: 1. 分解隐状态空间为语义和推理子空间;2. 使用SVD从Unembedding层参数中提取子空间基向量;3. 投影推理子空间特征作为检测输入。

Result: HARP在多个数据集上达到SOTA性能,例如在TriviaQA上AUROC为92.8%,比之前最佳方法提升7.5%。

Insight: 通过子空间分解和解耦,HARP不仅降低了特征维度,还显著提升了检测鲁棒性,为LLM的可靠应用提供了新思路。

Abstract: Hallucinations in Large Language Models (LLMs) pose a major barrier to their reliable use in critical decision-making. Although existing hallucination detection methods have improved accuracy, they still struggle with disentangling semantic and reasoning information and maintaining robustness. To address these challenges, we propose HARP (Hallucination detection via reasoning subspace projection), a novel hallucination detection framework. HARP establishes that the hidden state space of LLMs can be decomposed into a direct sum of a semantic subspace and a reasoning subspace, where the former encodes linguistic expression and the latter captures internal reasoning processes. Moreover, we demonstrate that the Unembedding layer can disentangle these subspaces, and by applying Singular Value Decomposition (SVD) to its parameters, the basis vectors spanning the semantic and reasoning subspaces are obtained. Finally, HARP projects hidden states onto the basis vectors of the reasoning subspace, and the resulting projections are then used as input features for hallucination detection in LLMs. By using these projections, HARP reduces the dimension of the feature to approximately 5% of the original, filters out most noise, and achieves enhanced robustness. Experiments across multiple datasets show that HARP achieves state-of-the-art hallucination detection performance; in particular, it achieves an AUROC of 92.8% on TriviaQA, outperforming the previous best method by 7.5%.

[23] HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

Wensheng Lu,Keyu Chen,Ruizhi Qiao,Xing Sun

Main category: cs.CL

TL;DR: 论文提出了HiCBench评估工具和HiChunk框架,用于改进RAG系统中的文档分块质量,并通过实验验证其有效性。

Details Motivation: 现有RAG评估基准在评估文档分块质量时表现不足,主要因证据稀疏性。需要新的工具和方法来优化分块质量。

Contribution: 1. HiCBench:多级文档分块标注和证据密集QA对的数据集;2. HiChunk框架:基于微调LLM的多级文档结构化框架,结合Auto-Merge检索算法提升检索质量。

Method: 1. 提出HiCBench评估工具,包含手动标注的分块点和合成的证据密集QA对;2. HiChunk框架利用微调LLM和Auto-Merge算法优化文档分块。

Result: 实验证明HiCBench能有效评估分块方法对RAG系统的影响,HiChunk在合理时间内提升分块质量及RAG整体性能。

Insight: 分块质量对RAG系统性能至关重要,多级分块和证据密集优化是提升检索效果的关键。

Abstract: Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This paper first analyzes why existing RAG evaluation benchmarks are inadequate for assessing document chunking quality, specifically due to evidence sparsity. Based on this conclusion, we propose HiCBench, which includes manually annotated multi-level document chunking points, synthesized evidence-dense quetion answer(QA) pairs, and their corresponding evidence sources. Additionally, we introduce the HiChunk framework, a multi-level document structuring framework based on fine-tuned LLMs, combined with the Auto-Merge retrieval algorithm to improve retrieval quality. Experiments demonstrate that HiCBench effectively evaluates the impact of different chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves better chunking quality within reasonable time consumption, thereby enhancing the overall performance of RAG systems.

[24] D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs

Yue Ding,Xiaofang Zhu,Tianze Xia,Junfei Wu,Xinlong Chen,Qiang Liu,Liang Wang

Main category: cs.CL

TL;DR: 该论文提出了一种无需训练和标签的幻觉检测框架D²HScore,通过分析LLMs生成过程中的语义广度和深度来检测幻觉,实验表明其性能优于现有方法。

Details Motivation: 大型语言模型(LLMs)在实际应用中常生成非事实内容(幻觉),尤其是在高风险领域(如金融、安全和医疗)中,确保其输出的可靠性至关重要。

Contribution: 提出了D²HScore框架,通过量化层内语义多样性(Intra-Layer Dispersion)和跨层关键语义演变(Inter-Layer Drift),实现了无需训练和标签的幻觉检测。

Method: D²HScore结合了两种动态分析:(1)层内离散度衡量语义多样性,(2)跨层漂移追踪关键语义演变,并通过注意力信号选择关键token。

Result: 在五个开源LLMs和五个基准测试中,D²HScore性能优于现有的无需训练基线方法。

Insight: 通过同时捕捉生成过程中token表示的水平和垂直动态,可以更有效地检测幻觉,为LLMs的可靠性提供了轻量且可解释的解决方案。

Abstract: Although large Language Models (LLMs) have achieved remarkable success, their practical application is often hindered by the generation of non-factual content, which is called “hallucination”. Ensuring the reliability of LLMs’ outputs is a critical challenge, particularly in high-stakes domains such as finance, security, and healthcare. In this work, we revisit hallucination detection from the perspective of model architecture and generation dynamics. Leveraging the multi-layer structure and autoregressive decoding process of LLMs, we decompose hallucination signals into two complementary dimensions: the semantic breadth of token representations within each layer, and the semantic depth of core concepts as they evolve across layers. Based on this insight, we propose \textbf{D$^2$HScore (Dispersion and Drift-based Hallucination Score)}, a training-free and label-free framework that jointly measures: (1) \textbf{Intra-Layer Dispersion}, which quantifies the semantic diversity of token representations within each layer; and (2) \textbf{Inter-Layer Drift}, which tracks the progressive transformation of key token representations across layers. To ensure drift reflects the evolution of meaningful semantics rather than noisy or redundant tokens, we guide token selection using attention signals. By capturing both the horizontal and vertical dynamics of representation during inference, D$^2$HScore provides an interpretable and lightweight proxy for hallucination detection. Extensive experiments across five open-source LLMs and five widely used benchmarks demonstrate that D$^2$HScore consistently outperforms existing training-free baselines.

[25] AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment

Kun Li,Lai-Man Po,Hongzheng Yang,Xuyuan Xu,Kangcheng Liu,Yuzhi Zhao

Main category: cs.CL

TL;DR: The paper introduces AesBiasBench, a benchmark to evaluate bias and alignment in Multimodal Large Language Models (MLLMs) for Personalized Image Aesthetic Assessment, revealing that smaller models show more bias while larger models align better with human preferences.

Details Motivation: To address the lack of frameworks for evaluating bias and alignment in MLLMs used for subjective tasks like image aesthetic assessment, particularly focusing on demographic influences.

Contribution: AesBiasBench, a benchmark with three subtasks and structured metrics (IFD, NRD, AAS) to measure bias and alignment in MLLMs.

Method: Evaluates 19 MLLMs (proprietary and open-source) by quantifying stereotype bias across demographic groups and alignment with human aesthetic preferences.

Result: Smaller models exhibit stronger stereotype biases, while larger models align better with human preferences. Identity information often worsens bias, especially in emotional judgments.

Insight: Identity-aware evaluation is crucial for subjective vision-language tasks to mitigate bias and improve alignment with human preferences.

Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied in Personalized Image Aesthetic Assessment (PIAA) as a scalable alternative to expert evaluations. However, their predictions may reflect subtle biases influenced by demographic factors such as gender, age, and education. In this work, we propose AesBiasBench, a benchmark designed to evaluate MLLMs along two complementary dimensions: (1) stereotype bias, quantified by measuring variations in aesthetic evaluations across demographic groups; and (2) alignment between model outputs and genuine human aesthetic preferences. Our benchmark covers three subtasks (Aesthetic Perception, Assessment, Empathy) and introduces structured metrics (IFD, NRD, AAS) to assess both bias and alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o, Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL). Results indicate that smaller models exhibit stronger stereotype biases, whereas larger models align more closely with human preferences. Incorporating identity information often exacerbates bias, particularly in emotional judgments. These findings underscore the importance of identity-aware evaluation frameworks in subjective vision-language tasks.

[26] EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI

Sai Kartheek Reddy Kasu

Main category: cs.CL

TL;DR: 该论文提出了EthicsMH,一个针对心理健康领域AI伦理推理的试点基准数据集,包含125个场景,旨在评估AI在伦理困境中的表现。

Details Motivation: 现有基准未能充分涵盖心理健康实践中的独特伦理问题,如保密性、自主权和偏见等,因此需要一个专门的评估工具。

Contribution: 引入EthicsMH数据集,提供结构化场景,包括多决策选项、专家推理和多方利益视角,填补了心理健康AI伦理评估的空白。

Method: 采用模型辅助生成方法创建125个伦理场景,每个场景包含结构化字段,用于评估决策准确性和解释质量。

Result: 尽管数据集规模较小,但为AI伦理和心理健康决策建立了桥梁,并为社区和专家提供了扩展的基础资源。

Insight: 心理健康领域的AI伦理决策需结合专业规范和多方视角,单一技术评估不足以应对复杂伦理问题。

Abstract: The deployment of large language models (LLMs) in mental health and other sensitive domains raises urgent questions about ethical reasoning, fairness, and responsible alignment. Yet, existing benchmarks for moral and clinical decision-making do not adequately capture the unique ethical dilemmas encountered in mental health practice, where confidentiality, autonomy, beneficence, and bias frequently intersect. To address this gap, we introduce Ethical Reasoning in Mental Health (EthicsMH), a pilot dataset of 125 scenarios designed to evaluate how AI systems navigate ethically charged situations in therapeutic and psychiatric contexts. Each scenario is enriched with structured fields, including multiple decision options, expert-aligned reasoning, expected model behavior, real-world impact, and multi-stakeholder viewpoints. This structure enables evaluation not only of decision accuracy but also of explanation quality and alignment with professional norms. Although modest in scale and developed with model-assisted generation, EthicsMH establishes a task framework that bridges AI ethics and mental health decision-making. By releasing this dataset, we aim to provide a seed resource that can be expanded through community and expert contributions, fostering the development of AI systems capable of responsibly handling some of society’s most delicate decisions.

[27] A Dynamic Knowledge Update-Driven Model with Large Language Models for Fake News Detection

Di Jin,Jun Yang,Xiaobao Wang,Junwei Zhang,Shuqi Li,Dongxiao He

Main category: cs.CL

TL;DR: 该论文提出了一种动态知识更新驱动的假新闻检测模型DYNAMO,结合知识图谱和大型语言模型,解决现有方法中检索内容可信度不足和噪声干扰的问题。

Details Motivation: 互联网信息爆炸和新闻事件的突发性与不稳定性导致假新闻检测需要动态更新知识,而现有检索增强生成方法存在内容可信度低和噪声干扰的问题。

Contribution: DYNAMO通过构建新闻领域知识图谱,结合蒙特卡洛树搜索分解验证新闻,动态更新知识,同时实现新闻真实性检测和新知识验证。

Method: 1) 构建新闻领域知识图谱;2) 使用蒙特卡洛树搜索逐步验证复杂新闻;3) 从已验证的真实新闻中提取并更新知识。

Result: 实验表明,DYNAMO在两个真实数据集上表现最佳。

Insight: 动态知识更新和知识图谱的结合能够有效提升假新闻检测的准确性和鲁棒性。

Abstract: As the Internet and social media evolve rapidly, distinguishing credible news from a vast amount of complex information poses a significant challenge. Due to the suddenness and instability of news events, the authenticity labels of news can potentially shift as events develop, making it crucial for fake news detection to obtain the latest event updates. Existing methods employ retrieval-augmented generation to fill knowledge gaps, but they suffer from issues such as insufficient credibility of retrieved content and interference from noisy information. We propose a dynamic knowledge update-driven model for fake news detection (DYNAMO), which leverages knowledge graphs to achieve continuous updating of new knowledge and integrates with large language models to fulfill dual functions: news authenticity detection and verification of new knowledge correctness, solving the two key problems of ensuring the authenticity of new knowledge and deeply mining news semantics. Specifically, we first construct a news-domain-specific knowledge graph. Then, we use Monte Carlo Tree Search to decompose complex news and verify them step by step. Finally, we extract and update new knowledge from verified real news texts and reasoning paths. Experimental results demonstrate that DYNAMO achieves the best performance on two real-world datasets.

[28] CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model

Wei-Hsin Yeh,Yu-An Su,Chih-Ning Chen,Yi-Hsueh Lin,Calvin Ku,Wen-Hsin Chiu,Min-Chun Hu,Lun-Wei Ku

Main category: cs.CL

TL;DR: CoachMe 是一个基于参考的运动指导生成模型,通过分析学习者的动作与参考动作在时间和物理层面的差异,提供高质量的运动技术改进指导,特别适用于滑冰和拳击等运动。

Details Motivation: 运动指导对运动员技术改进至关重要,但目前的多模态模型在生成精确的、针对特定运动的指导信息方面仍面临挑战。CoachMe 旨在解决这一问题。

Contribution: 提出了一种基于参考的模型 CoachMe,能够分析动作差异并提供高质量的运动指导,且在滑冰和拳击任务中显著优于 GPT-4o。

Method: 通过分析学习者动作与参考动作在时间和物理层面的差异,结合领域知识和教练思维,生成改进指导。

Result: 在滑冰和拳击任务中,CoachMe 的 G-Eval 分别比 GPT-4o 高出 31.6% 和 58.3%,并能详细说明错误及改进方法。

Insight: 领域知识和动作差异分析是生成高质量运动指导的关键,模型可以通过有限数据适应特定运动场景。

Abstract: Motion instruction is a crucial task that helps athletes refine their technique by analyzing movements and providing corrective guidance. Although recent advances in multimodal models have improved motion understanding, generating precise and sport-specific instruction remains challenging due to the highly domain-specific nature of sports and the need for informative guidance. We propose CoachMe, a reference-based model that analyzes the differences between a learner’s motion and a reference under temporal and physical aspects. This approach enables both domain-knowledge learning and the acquisition of a coach-like thinking process that identifies movement errors effectively and provides feedback to explain how to improve. In this paper, we illustrate how CoachMe adapts well to specific sports such as skating and boxing by learning from general movements and then leveraging limited data. Experiments show that CoachMe provides high-quality instructions instead of directions merely in the tone of a coach but without critical information. CoachMe outperforms GPT-4o by 31.6% in G-Eval on figure skating and by 58.3% on boxing. Analysis further confirms that it elaborates on errors and their corresponding improvement methods in the generated instructions. You can find CoachMe here: https://motionxperts.github.io/

[29] Room acoustics affect communicative success in hybrid meeting spaces: a pilot study

Robert Einig,Stefan Janscha,Jonas Schuster,Julian Koch,Martin Hagmueller,Barbara Schuppler

Main category: cs.CL

TL;DR: 这篇试点研究探讨了格拉茨工业大学研讨室声学设计对混合会议沟通效果的影响,发现改善声学环境可能提升沟通成功率,尽管样本量小导致结果未达统计显著性。

Details Motivation: 随着COVID-19疫情的持续,混合会议空间的需求激增,但声学设计常被忽视,可能导致沟通障碍、理解力下降或疲劳。研究旨在验证声学干预是否对混合会议的沟通效果有积极影响。

Contribution: 首次实证研究了混合会议空间中声学设计对沟通效果的影响,为未来更大规模研究提供了初步数据和方向。

Method: 记录了同一研讨室两组人员在声学干预前后的沟通情况,对比分析了干预前后的效果。

Result: 尽管样本量小,结果未达统计显著性,但声学干预明显改善了混合会议的沟通成功率。

Insight: 混合会议空间的声学设计是提升沟通效果的重要因素,未来研究应扩大样本量以验证这一效应。

Abstract: Since the COVID-19 pandemic in 2020, universities and companies have increasingly integrated hybrid features into their meeting spaces, or even created dedicated rooms for this purpose. While the importance of a fast and stable internet connection is often prioritized, the acoustic design of seminar rooms is frequently overlooked. Poor acoustics, particularly excessive reverberation, can lead to issues such as misunderstandings, reduced speech intelligibility or cognitive and vocal fatigue. This pilot study investigates whether room acoustic interventions in a seminar room at Graz University of Technology support better communication in hybrid meetings. For this purpose, we recorded two groups of persons twice, once before and once after improving the acoustics of the room. Our findings – despite not reaching statistical significance due to the small sample size - indicate clearly that our spatial interventions improve communicative success in hybrid meetings. To make the paper accessible also for readers from the speech communication community, we explain room acoustics background, relevant for the interpretation of our results.

[30] An Agentic Toolkit for Adaptive Information Extraction from Regulatory Documents

Gaye Colakoglu,Gürkan Solmaz,Jonathan Fürst

Main category: cs.CL

TL;DR: 论文提出了一种针对欧盟监管文件中声明性能(DoP)文档的自适应信息抽取工具包,解决了文档多样性和自动化提取的挑战。

Details Motivation: DoP文档在布局、语言、结构和格式上差异很大,现有静态或LLM单独的信息抽取方法容易产生幻觉且难以适应多样性。

Contribution: 论文的贡献是一个基于规划器-执行器-响应器架构的领域专用状态代理系统,能够动态推断用户意图、检测文档模态并协调工具。

Method: 采用了规划器-执行器-响应器的代理架构,结合动态工具编排和状态管理,避免工具误用或执行循环。

Result: 在DoP数据集上的评估表明,该系统在多样格式和语言下具有更强的鲁棒性,为监管工作流提供了可扩展的数据提取方案。

Insight: 通过代理架构和动态工具编排,可以有效解决文档多样性问题,提升信息抽取的可靠性和适应性。

Abstract: Declaration of Performance (DoP) documents, mandated by EU regulation, certify the performance of construction products. While some of their content is standardized, DoPs vary widely in layout, language, schema, and format, posing challenges for automated key-value pair extraction (KVP) and question answering (QA). Existing static or LLM-only IE pipelines often hallucinate and fail to adapt to this structural diversity. Our domain-specific, stateful agentic system addresses these challenges through a planner-executor-responder architecture. The system infers user intent, detects document modality, and orchestrates tools dynamically for robust, traceable reasoning while avoiding tool misuse or execution loops. Evaluation on a curated DoP dataset demonstrates improved robustness across formats and languages, offering a scalable solution for structured data extraction in regulated workflows.

[31] PledgeTracker: A System for Monitoring the Fulfilment of Pledges

Yulong Chen,Michael Sejr Schlichtkrull,Zhenyun Deng,David Corney,Nasim Asl,Joshua Salisbury,Andrew Dudfield,Andreas Vlachos

Main category: cs.CL

TL;DR: PledgeTracker 是一个用于监测政治承诺履行情况的系统,通过结构化事件时间轴构建解决动态、多文档和时间敏感性问题,显著减少了人工验证的工作量。

Details Motivation: 现有方法将政治承诺追踪简化为文档分类任务,忽视了其动态、时间和多文档的特性,导致效果不佳。

Contribution: 提出了PledgeTracker系统,将承诺验证重构为结构化事件时间轴构建问题,包含证据检索、时间轴构建和履行过滤三个核心模块。

Method: 系统包括多步证据检索模块、时间轴构建模块和履行过滤模块,通过动态捕捉证据演化和结构化输出时间轴。

Result: 与专业事实核查员合作的实际评估表明,该系统能有效检索相关证据并显著减少人工验证工作。

Insight: 结构化事件时间轴方法更适合动态、多文档的政治承诺验证任务,且能提升可解释性和效率。

Abstract: Political pledges reflect candidates’ policy commitments, but tracking their fulfilment requires reasoning over incremental evidence distributed across multiple, dynamically updated sources. Existing methods simplify this task into a document classification task, overlooking its dynamic, temporal and multi-document nature. To address this issue, we introduce \textsc{PledgeTracker}, a system that reformulates pledge verification into structured event timeline construction. PledgeTracker consists of three core components: (1) a multi-step evidence retrieval module; (2) a timeline construction module and; (3) a fulfilment filtering module, allowing the capture of the evolving nature of pledge fulfilment and producing interpretable and structured timelines. We evaluate PledgeTracker in collaboration with professional fact-checkers in real-world workflows, demonstrating its effectiveness in retrieving relevant evidence and reducing human verification effort.

[32] SCDTour: Embedding Axis Ordering and Merging for Interpretable Semantic Change Detection

Taichi Aida,Danushka Bollegala

Main category: cs.CL

TL;DR: SCDTour提出了一种通过排序和合并可解释轴的方法,平衡了语义变化检测(SCD)中的性能和可解释性,同时提升了SCD任务的表现。

Details Motivation: 当前在语义变化检测(SCD)中,提高embedding的可解释性往往会降低SCD的性能,反之亦然。这种权衡限制了模型在实际应用中的实用性。

Contribution: SCDTour提出了一种新颖的方法,通过排序和合并可解释轴,既保持了SCD的性能,又提升了embedding的可解释性。

Method: SCDTour考虑了embedding空间中轴的语义相似性及每根轴对语义变化的贡献程度,通过对轴进行排序和合并,生成一组更精细的词义表示。

Result: 实验结果表明,SCDTour在保留SCD性能的同时,显著提升了可解释性,即使合并后的低维embedding也能达到或超过原始高维embedding的性能。

Insight: 通过精心设计的轴排序和合并策略,可以在不牺牲性能的情况下显著提升embedding的可解释性,为语义变化的解释提供了新的途径。

Abstract: In Semantic Change Detection (SCD), it is a common problem to obtain embeddings that are both interpretable and high-performing. However, improving interpretability often leads to a loss in the SCD performance, and vice versa. To address this problem, we propose SCDTour, a method that orders and merges interpretable axes to alleviate the performance degradation of SCD. SCDTour considers both (a) semantic similarity between axes in the embedding space, as well as (b) the degree to which each axis contributes to semantic change. Experimental results show that SCDTour preserves performance in semantic change detection while maintaining high interpretability. Moreover, agglomerating the sorted axes produces a more refined set of word senses, which achieves comparable or improved performance against the original full-dimensional embeddings in the SCD task. These findings demonstrate that SCDTour effectively balances interpretability and SCD performance, enabling meaningful interpretation of semantic shifts through a small number of refined axes. Source code is available at https://github.com/LivNLP/svp-tour .

[33] MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues

Weishu Chen,Jinyi Tang,Zhouhui Hou,Shihao Han,Mingjie Zhan,Zhiyuan Huang,Delong Liu,Jiawei Guo,Zhicheng Zhao,Fei Su

Main category: cs.CL

TL;DR: MOOM 是一种双分支记忆插件,用于超长角色扮演对话中的记忆维护、组织和优化,通过建模情节发展和角色描写,结合遗忘机制控制记忆容量,并在 ZH-4O 数据集上验证了其优越性。

Details Motivation: 现有方法在超长对话中常出现记忆不可控增长的问题,需要一种能够结构化维护和优化记忆的解决方案。

Contribution: 提出了首个基于文学理论的双分支记忆插件 MOOM,结合情节冲突和角色描写;发布了一个中文超长对话数据集 ZH-4O。

Method: MOOM 的双分支分别处理多时间尺度的情节冲突和用户角色画像,并引入竞争抑制理论的遗忘机制控制容量。

Result: 实验显示 MOOM 优于现有方法,减少了大语言模型调用次数,同时保持可控的记忆容量。

Insight: 文学理论与遗忘机制的结合为对话系统中的记忆管理提供了新思路,数据集 ZH-4O 填补了中文超长对话领域的空白。

Abstract: Memory extraction is crucial for maintaining coherent ultra-long dialogues in human-robot role-playing scenarios. However, existing methods often exhibit uncontrolled memory growth. To address this, we propose MOOM, the first dual-branch memory plugin that leverages literary theory by modeling plot development and character portrayal as core storytelling elements. Specifically, one branch summarizes plot conflicts across multiple time scales, while the other extracts the user’s character profile. MOOM further integrates a forgetting mechanism, inspired by the ``competition-inhibition’’ memory theory, to constrain memory capacity and mitigate uncontrolled growth. Furthermore, we present ZH-4O, a Chinese ultra-long dialogue dataset specifically designed for role-playing, featuring dialogues that average 600 turns and include manually annotated memory information. Experimental results demonstrate that MOOM outperforms all state-of-the-art memory extraction methods, requiring fewer large language model invocations while maintaining a controllable memory capacity.

[34] Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding

Mingxiao Huo,Jiayi Zhang,Hewei Wang,Jinfeng Xu,Zheyu Chen,Huilin Tai,Yijun Chen

Main category: cs.CL

TL;DR: Spec-LLaVA是一个通过动态树结构的推测解码加速视觉语言模型的系统,结合轻量级草稿模型和大目标模型,显著提升了推理速度(最高3.28倍),同时保持生成质量。

Details Motivation: 现有的视觉语言模型(VLMs)推理速度慢,限制了其在实时应用中的部署。需要一种无损加速方法以提升效率。

Contribution: 1. 提出Spec-LLaVA框架,通过动态树结构的推测解码加速VLMs;2. 设计轻量级草稿模型与大目标模型配对,实现无损加速;3. 适用于资源受限或设备端部署场景。

Method: 1. 采用轻量级草稿模型生成推测的未来令牌;2. 大目标模型并行验证,动态树算法优化分支扩展与剪枝;3. 根据草稿模型置信度自适应调整分支。

Result: 在MS COCO数据集上,Spec-LLaVA在LLaVA-1.5(7B, 13B)上实现最高3.28倍的加速,且生成质量无损失。

Insight: 动态树结构的推测解码为多模态模型的实时应用提供了可行路径,尤其适合资源受限的场景。

Abstract: Vision-Language Models (VLMs) enable powerful multimodal reasoning but suffer from slow autoregressive inference, limiting their deployment in real-time applications. We introduce Spec-LLaVA, a system that applies speculative decoding to accelerate VLMs without sacrificing output quality. Spec-LLaVA pairs a lightweight draft VLM with a large target model: the draft speculates future tokens, which the target verifies in parallel, allowing multiple tokens to be generated per step. To maximize efficiency, we design a dynamic tree-based verification algorithm that adaptively expands and prunes speculative branches using draft model confidence. On MS COCO out-of-domain images, Spec-LLaVA achieves up to 3.28$\times$ faster decoding on LLaVA-1.5 (7B, 13B) with no loss in generation quality. This work presents a lossless acceleration framework for VLMs using dynamic tree-structured speculative decoding, opening a path toward practical real-time multimodal assistants. Importantly, the lightweight draft model design makes the framework amenable to resource-constrained or on-device deployment settings.

[35] ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Mayank Agarwal,Ibrahim Abdelaziz,Kinjal Basu,Merve Unuvar,Luis A. Lastras,Yara Rizk,Pavan Kapanipathi

Main category: cs.CL

TL;DR: 这篇论文提出了一个用于评估工具调用场景中奖励模型性能的新基准FC-RewardBench,并开发了一种基于结果的奖励模型训练框架,显著优于通用基线。

Details Motivation: 随着大型语言模型(LLMs)越来越多地与外部工具交互,工具使用的奖励建模成为一个关键但尚未充分探索的领域。现有的奖励模型主要基于自然语言输出训练,难以有效评估基于工具的推理和执行。

Contribution: 论文的主要贡献包括:1) 提出了首个专门评估工具调用场景中奖励模型性能的基准FC-RewardBench;2) 设计了一个基于结果的奖励模型训练框架,显著提升了模型在工具使用任务中的表现;3) 训练了从1.7B到14B参数的模型,展示了其在不同任务中的泛化能力。

Method: 论文方法包括:1) 使用FC-RewardBench量化现有模型的局限性;2) 基于开放权重的LLMs生成合成数据,训练基于结果的奖励模型;3) 通过奖励引导的过滤和数据高效微调提升了模型性能。

Result: 实验结果表明,提出的奖励模型在七个域外基准上显著优于通用基线,平均提升了25%的下游任务性能,并通过奖励引导的过滤实现了数据高效微调。

Insight: 研究揭示了现有奖励模型在工具调用场景中的局限性,强调了领域特定建模的重要性。此外,合成数据和开放权重LLMs的有效使用为训练高性能奖励模型提供了新方向。

Abstract: As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models’ performance in tool-calling scenarios. Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling. To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs. We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks. These models consistently outperform general-purpose baselines, achieving up to 25% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.

[36] SENSE models: an open source solution for multilingual and multimodal semantic-based tasks

Salima Mdhaffar,Haroun Elleuch,Chaimae Chellaf,Ha Nguyen,Yannick Estève

Main category: cs.CL

TL;DR: 该论文介绍了SENSE(多语言语音和文本共享嵌入),一种基于SAMU-XLSR框架的开源解决方案,改进包括更强的教师文本模型和更好的初始语音编码器。SENSE在多项多语言和多模态语义任务中表现出色,并提供了语义对齐语音编码器的新见解。

Details Motivation: 现有方法在多语言和多模态语义任务中存在局限性,需要一种更高效且开源的解决方案来提升语义对齐的性能。

Contribution: 提出了SENSE,一个开源的多语言和多模态语义对齐模型,集成了改进的教师模型和语音编码器,并通过SpeechBrain工具包公开发布。

Method: 采用师生框架,将自监督语音编码器与语言无关的文本编码器在语句级别对齐,选择更强的教师模型和更好的初始语音编码器。

Result: SENSE在多语言和多模态语义任务中表现出色,具有高度竞争力。

Insight: 研究揭示了语义如何在对齐的语音编码器中被捕捉,为未来研究提供了新方向。

Abstract: This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI’s SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder. The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders.

[37] Is ‘Hope’ a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities

Payam Latifi

Main category: cs.CL

TL;DR: 本文通过一个小规模基准测试比较了传统NLP工具和大型语言模型(LLMs)在命名实体识别(NER)任务中的表现,发现LLMs在上下文敏感实体上表现更优,但传统工具在结构化标签上更稳定。

Details Motivation: 研究动机是比较传统NLP工具与LLMs在NER任务中的表现差异,尤其是在处理模糊实体时的表现,为模型选择提供依据。

Contribution: 贡献在于构建了一个精心标注的小规模NER基准数据集,并对多系统性能进行了全面评估。

Method: 方法包括使用119个标记的黄金标准数据集,评估六个系统的性能(三个传统工具和三个LLMs),采用F1分数作为指标。

Result: 结果显示LLMs(如Gemini)在上下文敏感实体(如人名)上优于传统工具,但传统工具(如Stanza)在处理结构化标签(如地点、日期)时更一致。

Insight: 洞察是尽管LLMs在上下文理解上有优势,传统工具在特定任务中仍然不可忽视,且LLMs之间存在性能差异。

Abstract: This pilot study presents a small-scale but carefully annotated benchmark of Named Entity Recognition (NER) performance across six systems: three non-LLM NLP tools (NLTK, spaCy, Stanza) and three general-purpose large language models (LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B). The dataset contains 119 tokens covering five entity types (PERSON, LOCATION, ORGANIZATION, DATE, TIME). We evaluated each system’s output against the manually annotated gold standard dataset using F1-score. The results show that LLMs generally outperform conventional tools in recognizing context-sensitive entities like person names, with Gemini achieving the highest average F1-score. However, traditional systems like Stanza demonstrate greater consistency in structured tags such as LOCATION and DATE. We also observed variability among LLMs, particularly in handling temporal expressions and multi-word organizations. Our findings highlight that while LLMs offer improved contextual understanding, traditional tools remain competitive in specific tasks, informing model selection.

[38] In-domain SSL pre-training and streaming ASR

Jarod Duret,Salima Mdhaffar,Gaëlle Laperrière,Ryan Whetten,Audrey Galametz,Catherine Kobus,Marion-Cécile Martin,Jo Oleiwan,Yannick Estève

Main category: cs.CL

TL;DR: 论文研究了基于航空交通管制(ATC)领域的自监督预训练(SSL)对离线和流式自动语音识别(ASR)的提升效果,提出了一种结合分块注意力和动态卷积的低延迟流式处理方法,并在ATC任务中显著降低了词错误率。

Details Motivation: 现有的通用语音编码器在特定领域(如ATC)的表现可能不足,因此探索领域自适应的预训练模型和低延迟流式处理方法对提高ASR在安全关键应用中的性能至关重要。

Contribution: 1)提出了针对ATC数据的领域自监督预训练方法BEST-RQ;2)设计了低延迟的流式ASR架构,结合分块注意力和动态卷积;3)验证了领域自适应预训练和流式方法在ATC任务上的显著性能提升。

Method: 1)使用4.5k小时无标签ATC数据预训练BEST-RQ模型;2)在小规模有标签ATC数据上微调;3)采用分块注意力和动态卷积实现低延迟流式处理。

Result: 领域自适应预训练的BEST-RQ模型在ATC任务上显著优于通用语音编码器(如w2v-BERT 2.0和HuBERT),且提出的流式方法在低延迟约束下进一步降低了词错误率。

Insight: 领域专用的预训练和数据高效的流式处理是提升安全关键领域ASR性能的关键路径。

Abstract: In this study, we investigate the benefits of domain-specific self-supervised pre-training for both offline and streaming ASR in Air Traffic Control (ATC) environments. We train BEST-RQ models on 4.5k hours of unlabeled ATC data, then fine-tune on a smaller supervised ATC set. To enable real-time processing, we propose using chunked attention and dynamic convolutions, ensuring low-latency inference. We compare these in-domain SSL models against state-of-the-art, general-purpose speech encoders such as w2v-BERT 2.0 and HuBERT. Results show that domain-adapted pre-training substantially improves performance on standard ATC benchmarks, significantly reducing word error rates when compared to models trained on broad speech corpora. Furthermore, the proposed streaming approach further improves word error rate under tighter latency constraints, making it particularly suitable for safety-critical aviation applications. These findings highlight that specializing SSL representations for ATC data is a practical path toward more accurate and efficient ASR systems in real-world operational settings.

[39] GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models

Min Zeng,Jinfei Sun,Xueyou Luo,Caiquan Liu,Shiqi Zhang,Li Xie,Xiaoxin Chen

Main category: cs.CL

TL;DR: GTA框架结合监督微调(SFT)和强化学习(RL),通过“猜测-思考-回答”机制提升文本分类任务的效率和性能。

Details Motivation: 纯RL方法存在探索效率低和收敛慢的问题,而纯SFT方法的性能上限较低且理论基础较弱。GTA旨在结合两者的优势。

Contribution: 提出GTA框架,整合SFT的高效训练和RL的性能提升能力,并通过损失掩码和梯度约束解决信号冲突。

Method: GTA框架分为猜测(交叉熵损失)、反思(RL奖励优化)和回答三步,结合两种训练信号。

Result: 在四个文本分类基准测试中,GTA显著加速收敛,性能优于单独使用SFT或RL。

Insight: 混合监督和强化学习信号可以同时提升训练效率和模型能力,梯度约束是解决冲突的有效手段。

Abstract: In natural language processing tasks, pure reinforcement learning (RL) fine-tuning methods often suffer from inefficient exploration and slow convergence; while supervised fine-tuning (SFT) methods, although efficient in training, have limited performance ceiling and less solid theoretical foundation compared to RL. To address efficiency-capability trade-off, we propose the Guess-Think-Answer (GTA) framework that combines the efficiency of SFT with the capability gains of RL in a unified training paradigm. GTA works by having the model first produce a provisional guess (optimized via cross-entropy loss), then reflect on this guess before generating the final answer, with RL rewards shaping both the final output and the format of the entire GTA structure. This hybrid approach achieves both faster convergence than pure RL and higher performance ceiling than pure SFT. To mitigate gradient conflicts between the two training signals, we employ loss masking and gradient constraints. Empirical results on four text classification benchmarks demonstrate that GTA substantially accelerates convergence while outperforming both standalone SFT and RL baselines.

[40] CBP-Tuning: Efficient Local Customization for Black-box Large Language Models

Jiaxuan Zhao,Naibin Gu,Yuchen Feng,Xiyu Liu,Peng Fu,Zheng Lin,Weiping Wang

Main category: cs.CL

TL;DR: CBP-Tuning是一个高效本地定制黑盒大语言模型的框架,通过两阶段设计实现双向隐私保护,用户仅需单个定制向量即可完成任务适配。

Details Motivation: 高昂的大语言模型定制成本限制了其满足用户个性化需求的能力,而云服务模式又带来隐私风险,因此需要一种高效且隐私安全的本地定制方法。

Contribution: 提出CBP-Tuning框架,支持无需访问模型权重或上传隐私数据的本地定制,通过两阶段设计(服务器端Prompt生成和用户端无梯度优化)实现高效适配。

Method: 1. 服务器端训练Prompt生成器捕获领域通用能力;2. 用户端通过无梯度优化定制soft prompts,单个任务仅需一个向量。

Result: 在常识推理、医疗和金融领域,CBP-Tuning表现优于基线方法,展示了其在任务无关处理和隐私保护方面的优势。

Insight: 通过分离领域通用能力和任务定制,CBP-Tuning在隐私保护和性能之间取得了平衡,为黑盒模型的高效定制提供了新思路。

Abstract: The high costs of customizing large language models (LLMs) fundamentally limit their adaptability to user-specific needs. Consequently, LLMs are increasingly offered as cloud-based services, a paradigm that introduces critical limitations: providers struggle to support personalized customization at scale, while users face privacy risks when exposing sensitive data. To address this dual challenge, we propose Customized Black-box Prompt Tuning (CBP-Tuning), a novel framework that facilitates efficient local customization while preserving bidirectional privacy. Specifically, we design a two-stage framework: (1) a prompt generator trained on the server-side to capture domain-specific and task-agnostic capabilities, and (2) user-side gradient-free optimization that tailors soft prompts for individual tasks. This approach eliminates the need for users to access model weights or upload private data, requiring only a single customized vector per task while achieving effective adaptation. Furthermore, the evaluation of CBP-Tuning in the commonsense reasoning, medical and financial domain settings demonstrates superior performance compared to baselines, showcasing its advantages in task-agnostic processing and privacy preservation.

[41] XplaiNLP at CheckThat! 2025: Multilingual Subjectivity Detection with Finetuned Transformers and Prompt-Based Inference with Large Language Models

Ariana Sahitaj,Jiaao Li,Pia Wenzel Neves,Fedor Splitt,Premtim Sahitaj,Charlott Jakob,Veronika Solopova,Vera Schmitt

Main category: cs.CL

TL;DR: XplaiNLP团队在CheckThat! 2025任务中研究了多语言主观性检测的两种方法:微调Transformer模型和大语言模型的零样本推理。Annotation方法在意大利语任务中表现最优,XLM-RoBERTa在罗马尼亚语任务中排名第三。

Details Motivation: 研究多语言主观性检测的挑战,探索微调Transformer模型和零样本大语言模型在不同语言中的效果。

Contribution: 提出了两种方法:微调Transformer模型和零样本推理LLMs,并在多种语言任务中验证了其有效性。

Method: 1. 微调EuroBERT、XLM-RoBERTa和German-BERT;2. 使用o3-mini和gpt-4.1-mini进行零样本推理,包括Annotation、DoubleDown和Perspective方法。

Result: 意大利语任务中F1得分0.8104,排名第一;罗马尼亚语任务中XLM-RoBERTa得分0.7917。部分任务表现低于基线,反映低资源语言的泛化挑战。

Insight: 微调模型在特定语言中表现优异,而零样本方法在资源匮乏的语言中仍需改进。

Abstract: This notebook reports the XplaiNLP submission to the CheckThat! 2025 shared task on multilingual subjectivity detection. We evaluate two approaches: (1) supervised fine-tuning of transformer encoders, EuroBERT, XLM-RoBERTa, and German-BERT, on monolingual and machine-translated training data; and (2) zero-shot prompting using two LLMs: o3-mini for Annotation (rule-based labelling) and gpt-4.1-mini for DoubleDown (contrastive rewriting) and Perspective (comparative reasoning). The Annotation Approach achieves 1st place in the Italian monolingual subtask with an F_1 score of 0.8104, outperforming the baseline of 0.6941. In the Romanian zero-shot setting, the fine-tuned XLM-RoBERTa model obtains an F_1 score of 0.7917, ranking 3rd and exceeding the baseline of 0.6461. The same model also performs reliably in the multilingual task and improves over the baseline in Greek. For German, a German-BERT model fine-tuned on translated training data from typologically related languages yields competitive performance over the baseline. In contrast, performance in the Ukrainian and Polish zero-shot settings falls slightly below the respective baselines, reflecting the challenge of generalization in low-resource cross-lingual scenarios.

cs.CV [Back]

[42] A Real-Time Diminished Reality Approach to Privacy in MR Collaboration

Christian Fane

Main category: cs.CV

TL;DR: 该论文提出了一种基于实时修复技术的削弱现实(DR)系统,用于在共享空间混合现实(MR)会议中保护隐私。系统允许主要头戴设备用户选择性移除环境中的敏感物品,确保其他参与者无法看到这些物品。

Details Motivation: 在共享空间MR协作中,隐私保护是一个重要问题。传统的解决方案可能需要固定视角或预先3D扫描环境,限制了灵活性和实用性。

Contribution: 1. 提出了一个便携且鲁棒的实时DR系统,无需固定视角或预先3D扫描;2. 结合语义分割和目标选择技术,实现精准对象移除;3. 使用改进的解耦时空变换器(DSTT)模型,实现高质量视频修复。

Method: 1. 使用YOLOv11进行对象检测;2. 通过语义分割和精准选择确定移除目标;3. 利用改进的DSTT模型进行实时视频修复;4. 在ZED 2i深度相机上实现。

Result: 系统在720p分辨率下帧率超过20 fps,证明了实时DR在MR隐私保护应用中的可行性。

Insight: 基于实时修复的DR技术可以有效平衡MR协作中的隐私保护需求与系统性能,强调了语义分割和时空变换器在提升修复质量中的重要性。

Abstract: Diminished reality (DR) refers to the digital removal of real-world objects by compositing background content in their place. This thesis presents a real-time, inpainting-based DR system designed to enable privacy control in shared-space mixed reality (MR) meetings. The system allows a primary headset user to selectively remove personal or sensitive items from their environment, ensuring that those objects are no longer visible to other participants. Removal is achieved through semantic segmentation and precise object selection, followed by real-time inpainting from the viewpoint of a secondary observer, implemented using a mobile ZED 2i depth camera. The solution is designed to be portable and robust, requiring neither a fixed secondary viewpoint nor prior 3D scanning of the environment. The system utilises YOLOv11 for object detection and a modified Decoupled Spatial-Temporal Transformer (DSTT) model for high-quality video inpainting. At 720p resolution, the pipeline sustains frame rates exceeding 20 fps, demonstrating the feasibility of real-time diminished reality for practical privacy-preserving MR applications.

[43] SurgLaVi: Large-Scale Hierarchical Dataset for Surgical Vision-Language Representation Learning

Alejandra Perez,Chinedu Nwoye,Ramtin Raji Kermani,Omid Mohareri,Muhammad Abdullah Jamal

Main category: cs.CV

TL;DR: SurgLaVi 是一个大规模、多样化的手术视觉-语言数据集,包含近24万对视频剪辑-文本描述,来自200多种手术过程,并具有分层结构。通过自动化流程生成高质量的注释,并通过双模态过滤剔除噪声数据。SurgLaVi-β是其公开版本,规模远超现有数据集。基于此提出的SurgCLIP模型在多项任务中表现优异。

Details Motivation: 现有的手术视觉-语言预训练数据集规模小、多样性不足且缺乏分层结构,限制了模型性能的提升。需要一种更大规模、更高质量的数据集来推动手术领域的VLP发展。

Contribution: 1. 提出了目前规模最大、多样性最丰富的手术视觉-语言数据集SurgLaVi,含分层注释。2. 开发了一个自动化流程,用于生成高质量的分层标注。3. 发布了公开版本数据集SurgLaVi-β。4. 提出了SurgCLIP模型,在多任务中表现优异。

Method: 1. 自动化标注流程生成分层的视频剪辑-文本描述对(阶段、步骤、任务级)。2. 双模态过滤剔除噪声数据。3. SurgCLIP采用CLIP风格的双编码器,通过视频-文本对比学习提升性能。

Result: SurgCLIP在阶段、步骤、动作和工具识别任务中均超越现有方法(部分提升较大),验证了大规模、高质量数据集对性能的显著提升作用。

Insight: 数据集的质量(规模和语义丰富性)及分层结构设计对视觉-语言模型的性能至关重要,尤其在手术领域。大规模公开数据集的发布将推动基础模型的开发。

Abstract: Vision-language pre-training (VLP) offers unique advantages for surgery by aligning language with surgical videos, enabling workflow understanding and transfer across tasks without relying on expert-labeled datasets. However, progress in surgical VLP remains constrained by the limited scale, procedural diversity, semantic quality, and hierarchical structure of existing datasets. In this work, we present SurgLaVi, the largest and most diverse surgical vision-language dataset to date, comprising nearly 240k clip-caption pairs from more than 200 procedures, and comprising hierarchical levels at phase-, step-, and task-level. At the core of SurgLaVi lies a fully automated pipeline that systematically generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units. To ensure high-quality annotations, it applies dual-modality filtering to remove irrelevant and noisy samples. Within this framework, the resulting captions are enriched with contextual detail, producing annotations that are both semantically rich and easy to interpret. To ensure accessibility, we release SurgLaVi-\b{eta}, an open-source derivative of 113k clip-caption pairs constructed entirely from public data, which is over four times larger than existing surgical VLP datasets. To demonstrate the value of SurgLaVi datasets, we introduce SurgCLIP, a CLIP-style video-text contrastive framework with dual encoders, as a representative base model. SurgCLIP achieves consistent improvements across phase, step, action, and tool recognition, surpassing prior state-of-the-art methods, often by large margins. These results validate that large-scale, semantically rich, and hierarchically structured datasets directly translate into stronger and more generalizable representations, establishing SurgLaVi as a key resource for developing surgical foundation models.

[44] Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses

Emily Kaczmarek,Justin Szeto,Brennan Nichyporuk,Tal Arbel

Main category: cs.CV

TL;DR: 该论文提出了一种基于SimCLR的自监督学习(SSL)基础模型,用于3D脑MRI分析,模型在多种神经系统疾病的数据集上预训练,并在多个下游任务中表现优异。

Details Motivation: 3D脑MRI在临床诊断中广泛应用,但现有深度学习模型通常针对特定任务,泛化能力有限。自监督学习(SSL)可以解决这一局限性,但目前3D脑MRI的基础模型在分辨率、范围或可访问性上仍显不足。

Contribution: 提出了一种通用的、高分辨率的SimCLR SSL基础模型,并在11个公开数据集上预训练。该模型在多个任务中表现优于其他方法,且在小样本任务中效果显著。

Method: 基于SimCLR框架的自监督学习方法,预训练数据包含18,759名患者的44,958次扫描,涵盖多种神经系统疾病。模型与Masked Autoencoders(MAE)和监督基线进行了对比。

Result: 在四个下游任务(包括分布内和分布外设置)中,该SimCLR模型的微调版本均表现最优,尤其是在仅使用20%标记数据预测阿尔茨海默病时仍保持高性能。

Insight: 通过大规模多样化数据集预训练的SSL基础模型能显著提升3D脑MRI任务的性能,甚至在数据稀缺时仍能表现优异,为临床诊断提供了有力工具。

Abstract: 3D structural Magnetic Resonance Imaging (MRI) brain scans are commonly acquired in clinical settings to monitor a wide range of neurological conditions, including neurodegenerative disorders and stroke. While deep learning models have shown promising results analyzing 3D MRI across a number of brain imaging tasks, most are highly tailored for specific tasks with limited labeled data, and are not able to generalize across tasks and/or populations. The development of self-supervised learning (SSL) has enabled the creation of large medical foundation models that leverage diverse, unlabeled datasets ranging from healthy to diseased data, showing significant success in 2D medical imaging applications. However, even the very few foundation models for 3D brain MRI that have been developed remain limited in resolution, scope, or accessibility. In this work, we present a general, high-resolution SimCLR-based SSL foundation model for 3D brain structural MRI, pre-trained on 18,759 patients (44,958 scans) from 11 publicly available datasets spanning diverse neurological diseases. We compare our model to Masked Autoencoders (MAE), as well as two supervised baselines, on four diverse downstream prediction tasks in both in-distribution and out-of-distribution settings. Our fine-tuned SimCLR model outperforms all other models across all tasks. Notably, our model still achieves superior performance when fine-tuned using only 20% of labeled training samples for predicting Alzheimer’s disease. We use publicly available code and data, and release our trained model at https://github.com/emilykaczmarek/3D-Neuro-SimCLR, contributing a broadly applicable and accessible foundation model for clinical brain MRI analysis.

[45] A Comparison and Evaluation of Fine-tuned Convolutional Neural Networks to Large Language Models for Image Classification and Segmentation of Brain Tumors on MRI

Felicia Liu,Jay J. Yoo,Farzad Khalvati

Main category: cs.CV

TL;DR: 本文比较了微调卷积神经网络(CNN)与大型语言模型(LLM)在脑肿瘤MRI图像分类和分割任务中的表现,发现CNN在两项任务中均优于LLM。

Details Motivation: 大型语言模型(LLM)在文本医疗任务中表现出色,但其在图像任务中的应用尚未充分探索。本文旨在验证LLM在医学影像任务中的有效性。

Contribution: 主要贡献包括对LLM在脑肿瘤分类和分割任务中的首次系统评估,并与传统CNN进行对比。

Method: 使用BraTS 2020数据集,对通用视觉语言LLM(LLaMA 3.2 Instruct)进行微调前后评估,并与3D CNN进行性能对比。

Result: CNN在分类任务中准确率达80%,而LLM仅为76%;分割任务中CNN表现稳定,LLM则缺乏空间理解能力,微调效果有限。

Insight: LLM在医学影像任务中表现不佳,可能需要更严格的微调或其他训练策略以提升性能。

Abstract: Large Language Models (LLMs) have shown strong performance in text-based healthcare tasks. However, their utility in image-based applications remains unexplored. We investigate the effectiveness of LLMs for medical imaging tasks, specifically glioma classification and segmentation, and compare their performance to that of traditional convolutional neural networks (CNNs). Using the BraTS 2020 dataset of multi-modal brain MRIs, we evaluated a general-purpose vision-language LLM (LLaMA 3.2 Instruct) both before and after fine-tuning, and benchmarked its performance against custom 3D CNNs. For glioma classification (Low-Grade vs. High-Grade), the CNN achieved 80% accuracy and balanced precision and recall. The general LLM reached 76% accuracy but suffered from a specificity of only 18%, often misclassifying Low-Grade tumors. Fine-tuning improved specificity to 55%, but overall performance declined (e.g., accuracy dropped to 72%). For segmentation, three methods - center point, bounding box, and polygon extraction, were implemented. CNNs accurately localized gliomas, though small tumors were sometimes missed. In contrast, LLMs consistently clustered predictions near the image center, with no distinction of glioma size, location, or placement. Fine-tuning improved output formatting but failed to meaningfully enhance spatial accuracy. The bounding polygon method yielded random, unstructured outputs. Overall, CNNs outperformed LLMs in both tasks. LLMs showed limited spatial understanding and minimal improvement from fine-tuning, indicating that, in their current form, they are not well-suited for image-based tasks. More rigorous fine-tuning or alternative training strategies may be needed for LLMs to achieve better performance, robustness, and utility in the medical space.

[46] Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Hao Zhang,Chun-Han Yao,Simon Donné,Narendra Ahuja,Varun Jampani

Main category: cs.CV

TL;DR: SP4D是一个生成配对RGB和运动学部件视频的框架,通过双分支扩散模型联合合成RGB帧和部件分割图,支持多视角和时间一致的部件生成。

Details Motivation: 传统部件分割方法依赖基于外观的语义线索,而SP4D旨在生成与物体关节对齐且在多视角和时间上一致的部件分割。

Contribution: 1. 提出SP4D框架,联合生成RGB和运动学部件视频;2. 引入空间颜色编码方案,简化架构并支持不同部件数量;3. 设计BiDiFuse模块和对比损失增强跨分支一致性;4. 构建KinematicParts20K数据集用于训练和评估。

Method: 1. 双分支扩散模型联合生成RGB和部件分割图;2. 空间颜色编码将部件掩码映射到连续RGB图像;3. BiDiFuse模块和对比损失提升多视角和时间一致性。

Result: SP4D能够泛化到多样场景(真实视频、生成物体、罕见姿态),生成的2D部件图可提升为3D骨骼结构和蒙皮权重。

Insight: 1. 运动学部件生成支持动画和运动相关任务;2. 颜色编码实现了部件分割与RGB生成的高效融合。

Abstract: We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

[47] SegSLR: Promptable Video Segmentation for Isolated Sign Language Recognition

Sven Schreiber,Noha Sarhan,Simone Frintrop,Christian Wilms

Main category: cs.CV

TL;DR: SegSLR提出了一种结合RGB和姿态信息的可提示零样本视频分割方法,用于孤立手语识别(ISLR),通过聚焦手和身体的关键部位提升性能。

Details Motivation: 现有ISLR方法多依赖RGB或姿态信息,但结合时易丢失关键细节(如手形、方向),原因是粗糙表示(如边界框)不精准。

Contribution: 提出SegSLR系统,通过可提示视频分割结合RGB和姿态信息,保留关键形状信息并聚焦处理相关身体部位。

Method: 利用姿态信息粗略定位手和身体,通过零样本视频分割提取这些部位,随后处理RGB数据时专注于关键区域。

Result: 在ChaLearn249 IsoGD数据集上性能超越现有方法,消融实验表明聚焦手和身体的设计有效。

Insight: 结合多模态时,精准分割关键区域能显著提升ISLR性能,避免粗糙表示的信息丢失。

Abstract: Isolated Sign Language Recognition (ISLR) approaches primarily rely on RGB data or signer pose information. However, combining these modalities often results in the loss of crucial details, such as hand shape and orientation, due to imprecise representations like bounding boxes. Therefore, we propose the ISLR system SegSLR, which combines RGB and pose information through promptable zero-shot video segmentation. Given the rough localization of the hands and the signer’s body from pose information, we segment the respective parts through the video to maintain all relevant shape information. Subsequently, the segmentations focus the processing of the RGB data on the most relevant body parts for ISLR. This effectively combines RGB and pose information. Our evaluation on the complex ChaLearn249 IsoGD dataset shows that SegSLR outperforms state-of-the-art methods. Furthermore, ablation studies indicate that SegSLR strongly benefits from focusing on the signer’s body and hands, justifying our design choices.

[48] SCOPE: Speech-guided COllaborative PErception Framework for Surgical Scene Segmentation

Jecia Z. Y. Mao,Francis X Creighton,Russell H Taylor,Manish Sahu

Main category: cs.CV

TL;DR: SCOPE框架通过整合大型语言模型(LLM)和开放集视觉基础模型(VFM),结合语音引导,实现手术场景的动态分割和标注,无需依赖特定领域数据或手动提示。

Details Motivation: 现有手术场景分割方法依赖特定领域的标注数据和预定义标签,难以适应新场景或不熟悉类别。语音交互和开放集模型的结合可以提升实时性和灵活性。

Contribution: 1. 提出SCOPE框架,整合LLM与VFM,支持动态分割和标注手术场景;2. 引入语音反馈,实现自然的人机协作;3. 通过实验验证其在真实手术环境中的潜力。

Method: 1. 利用VFM生成初步分割候选;2. 结合语音反馈通过LLM优化分割结果;3. 通过手术器械作为交互指针标注其他场景元素。

Result: 在Cataract1k和内部数据集上的实验表明,SCOPE能动态分割和跟踪手术场景,并通过模拟实验验证了其灵活性。

Insight: 语音引导与开放集模型的结合为动态手术环境提供了一种无需标注数据的解决方案,展示了人机协作在医疗中的潜力。

Abstract: Accurate segmentation and tracking of relevant elements of the surgical scene is crucial to enable context-aware intraoperative assistance and decision making. Current solutions remain tethered to domain-specific, supervised models that rely on labeled data and required domain-specific data to adapt to new surgical scenarios and beyond predefined label categories. Recent advances in prompt-driven vision foundation models (VFM) have enabled open-set, zero-shot segmentation across heterogeneous medical images. However, dependence of these models on manual visual or textual cues restricts their deployment in introperative surgical settings. We introduce a speech-guided collaborative perception (SCOPE) framework that integrates reasoning capabilities of large language model (LLM) with perception capabilities of open-set VFMs to support on-the-fly segmentation, labeling and tracking of surgical instruments and anatomy in intraoperative video streams. A key component of this framework is a collaborative perception agent, which generates top candidates of VFM-generated segmentation and incorporates intuitive speech feedback from clinicians to guide the segmentation of surgical instruments in a natural human-machine collaboration paradigm. Afterwards, instruments themselves serve as interactive pointers to label additional elements of the surgical scene. We evaluated our proposed framework on a subset of publicly available Cataract1k dataset and an in-house ex-vivo skull-base dataset to demonstrate its potential to generate on-the-fly segmentation and tracking of surgical scene. Furthermore, we demonstrate its dynamic capabilities through a live mock ex-vivo experiment. This human-AI collaboration paradigm showcase the potential of developing adaptable, hands-free, surgeon-centric tools for dynamic operating-room environments.

[49] Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation

Yi-Ruei Liu,You-Zhe Xie,Yu-Hsiang Hsu,I-Sheng Fang,Yu-Lun Liu,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为4D高斯光线追踪(4D-GRT)的两阶段方法,用于生成具有物理精确相机效果的训练数据。通过结合4D高斯泼溅(Gaussian Splatting)和基于物理的光线追踪,该方法能够在多视角视频中高效重建动态场景并生成可控的相机效果。

Details Motivation: 现有的计算机视觉系统通常假设为理想针孔相机,无法应对真实世界中的相机效果(如鱼眼畸变、滚动快门等),主要原因是缺乏具有相机效果的训练数据。现有的数据生成方法要么成本高昂,要么存在仿真与现实的差距,或无法精确模拟相机效果。

Contribution: 1. 提出4D高斯光线追踪(4D-GRT)方法,结合4D高斯泼溅和光线追踪,快速生成具有精确相机效果的训练数据;2. 构建了一个包含八种动态场景的基准数据集,涵盖四种相机效果,用于评估生成视频。

Method: 方法分为两阶段:1. 使用4D高斯泼溅从多视角视频中重建动态场景;2. 应用基于物理的光线追踪,对重建的场景添加可控的相机效果。

Result: 4D-GRT在渲染速度上表现最佳,同时渲染质量与现有基线方法相当或更优。

Insight: 该方法提供了一种低成本、高效率的解决方案,填补了仿真与现实之间的差距,为计算机视觉系统在真实场景中的应用提供了更高质量的训练数据。

Abstract: Common computer vision systems typically assume ideal pinhole cameras but fail when facing real-world camera effects such as fisheye distortion and rolling shutter, mainly due to the lack of learning from training data with camera effects. Existing data generation approaches suffer from either high costs, sim-to-real gaps or fail to accurately model camera effects. To address this bottleneck, we propose 4D Gaussian Ray Tracing (4D-GRT), a novel two-stage pipeline that combines 4D Gaussian Splatting with physically-based ray tracing for camera effect simulation. Given multi-view videos, 4D-GRT first reconstructs dynamic scenes, then applies ray tracing to generate videos with controllable, physically accurate camera effects. 4D-GRT achieves the fastest rendering speed while performing better or comparable rendering quality compared to existing baselines. Additionally, we construct eight synthetic dynamic scenes in indoor environments across four camera effects as a benchmark to evaluate generated videos with camera effects.

[50] EditDuet: A Multi-Agent System for Video Non-Linear Editing

Marcelo Sandoval-Castaneda,Bryan Russell,Josef Sivic,Gregory Shakhnarovich,Fabian Caba Heilbron

Main category: cs.CV

TL;DR: EditDuet是一种多智能体系统,通过编辑和批评两类智能体合作,实现基于自然语言指令的视频非线性编辑任务,显著优于现有方法。

Details Motivation: 现有视频编辑工具主要关注检索或用户界面,而将实际编辑任务留给用户,本文旨在自动化这一核心任务,提升效率和编辑质量。

Contribution: 提出多智能体系统EditDuet,通过编辑和批评智能体的协作,实现语言驱动的视频编辑任务,并引入基于LLM的评估方法。

Method: 设计Editor和Critic两个智能体,前者负责根据自然语言指令编辑视频,后者提供反馈或确认结果;采用学习框架优化智能体间的通信。

Result: 系统在用户研究中表现优越,在覆盖范围、时间约束满足和人类偏好方面显著优于现有方法。

Insight: 多智能体协作和自然语言反馈机制为自动化视频编辑提供了新思路,LLM评估方法可有效补充人类判断。

Abstract: Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving actual editing to the user. In contrast, we propose to automate the core task of video editing, formulating it as sequential decision making process. Ours is a multi-agent approach. We design an Editor agent and a Critic agent. The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence. On the other hand, the Critic gives natural language feedback to the editor based on the produced sequence or renders it if it is satisfactory. We introduce a learning-based approach for enabling effective communication across specialized agents to address the language-driven video editing task. Finally, we explore an LLM-as-a-judge metric for evaluating the quality of video editing system and compare it with general human preference. We evaluate our system’s output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference.

[51] Enhancement Without Contrast: Stability-Aware Multicenter Machine Learning for Glioma MRI Imaging

Sajad Amiri,Shahram Taeb,Sara Gharibi,Setareh Dehghanfard,Somayeh Sadat Mehrnia,Mehrdad Oveisi,Ilker Hacihaliloglu,Arman Rahmim,Mohammad R. Salmanpour

Main category: cs.CV

TL;DR: 论文提出了一种稳定性感知的机器学习框架,用于从非对比MRI中预测胶质瘤的对比增强效果,以减少对钆基对比剂的依赖。通过在四个多中心数据集上验证,该方法表现出高准确性和稳定性。

Details Motivation: 钆基对比剂在胶质瘤成像中虽重要,但存在安全、成本和可及性问题。利用机器学习从非对比MRI预测对比增强效果是一种更安全的替代方案。然而,扫描仪和队列的变异性阻碍了模型的稳健选择。

Contribution: 提出了一种稳定性感知的机器学习框架,用于在多中心数据集上选择可重复的模型,以预测胶质瘤MRI对比增强效果。

Method: 从非对比T1加权图像(T1WI)中提取108个特征,结合48种降维方法和25种分类器,生成了1200个模型流水线。通过旋转验证(在三个数据集上训练,在第四个上测试)评估模型性能。

Result: 交叉验证预测准确率为0.91至0.96,外部测试准确率为0.87至0.98(平均0.93)。F1、精准率和召回率稳定(0.87至0.96),ROC-AUC波动较大(0.50至0.82)。结合MI和ETr的流水线表现最佳。

Insight: 该框架表明,稳定性感知的模型选择能够在减少对钆基对比剂依赖的同时,提高模型的跨中心泛化能力,为神经肿瘤学及其他领域的可重复机器学习提供了模板。

Abstract: Gadolinium-based contrast agents (GBCAs) are central to glioma imaging but raise safety, cost, and accessibility concerns. Predicting contrast enhancement from non-contrast MRI using machine learning (ML) offers a safer alternative, as enhancement reflects tumor aggressiveness and informs treatment planning. Yet scanner and cohort variability hinder robust model selection. We propose a stability-aware framework to identify reproducible ML pipelines for multicenter prediction of glioma MRI contrast enhancement. We analyzed 1,446 glioma cases from four TCIA datasets (UCSF-PDGM, UPENN-GB, BRATS-Africa, BRATS-TCGA-LGG). Non-contrast T1WI served as input, with enhancement derived from paired post-contrast T1WI. Using PyRadiomics under IBSI standards, 108 features were extracted and combined with 48 dimensionality reduction methods and 25 classifiers, yielding 1,200 pipelines. Rotational validation was trained on three datasets and tested on the fourth. Cross-validation prediction accuracies ranged from 0.91 to 0.96, with external testing achieving 0.87 (UCSF-PDGM), 0.98 (UPENN-GB), and 0.95 (BRATS-Africa), with an average of 0.93. F1, precision, and recall were stable (0.87 to 0.96), while ROC-AUC varied more widely (0.50 to 0.82), reflecting cohort heterogeneity. The MI linked with ETr pipeline consistently ranked highest, balancing accuracy and stability. This framework demonstrates that stability-aware model selection enables reliable prediction of contrast enhancement from non-contrast glioma MRI, reducing reliance on GBCAs and improving generalizability across centers. It provides a scalable template for reproducible ML in neuro-oncology and beyond.

[52] Group Evidence Matters: Tiling-based Semantic Gating for Dense Object Detection

Yilun Xiao

Main category: cs.CV

TL;DR: 该论文提出了一种与检测器无关的后处理框架,通过分块重叠验证组证据,显著提升了密集小物体检测的召回率,同时明确提出了召回优先的策略。

Details Motivation: 在无人机图像中,由于视角距离远、遮挡和杂乱,密集小物体的检测效果较差,需要一种无需重新训练的方法提升召回率。

Contribution: 1. 提出了一个检测器无关的后处理框架;2. 通过分块重叠、空间和语义门验证组证据;3. 实现了召回率的显著提升(+0.093)。

Method: 1. 重叠分块恢复低置信度候选;2. 空间门(基于DBSCAN的框中心聚类)和语义门(基于ResNet-18嵌入的DBSCAN)验证组证据;3. 对验证后的组进行置信度重加权,再结合类别感知的NMS融合。

Result: 在VisDrone数据集上,召回率从0.685提升到0.778,精确度从0.801调整到0.595,F1得分为0.669,后处理延迟为每图0.095秒。

Insight: 分块重叠能够暴露被遗漏的物体,空间聚类稳定几何特征,语义聚类确保外观一致性,重加权则实现了与基线的校准整合。召回优先的策略适用于远场计数和监控等应用。

Abstract: Dense small objects in UAV imagery are often missed due to long-range viewpoints, occlusion, and clutter[cite: 5]. This paper presents a detector-agnostic post-processing framework that converts overlap-induced redundancy into group evidence[cite: 6]. Overlapping tiling first recovers low-confidence candidates[cite: 7]. A Spatial Gate (DBSCAN on box centroids) and a Semantic Gate (DBSCAN on ResNet-18 embeddings) then validates group evidence[cite: 7]. Validated groups receive controlled confidence reweighting before class-aware NMS fusion[cite: 8]. Experiments on VisDrone show a recall increase from 0.685 to 0.778 (+0.093) and a precision adjustment from 0.801 to 0.595, yielding F1=0.669[cite: 9]. Post-processing latency averages 0.095 s per image[cite: 10]. These results indicate recall-first, precision-trade-off behavior that benefits recall-sensitive applications such as far-field counting and monitoring[cite: 10]. Ablation confirms that tiling exposes missed objects, spatial clustering stabilizes geometry, semantic clustering enforces appearance coherence, and reweighting provides calibrated integration with the baseline[cite: 11]. The framework requires no retraining and integrates with modern detectors[cite: 12]. Future work will reduce semantic gating cost and extend the approach with temporal cues[cite: 13].

[53] InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Weipeng Zhong,Peizhou Cao,Yichen Jin,Li Luo,Wenzhe Cai,Jingli Lin,Hanqing Wang,Zhaoyang Lyu,Tai Wang,Bo Dai,Xudong Xu,Jiangmiao Pang

Main category: cs.CV

TL;DR: 论文提出了InternScenes,一个大规模、可模拟的室内场景数据集,通过整合多种场景来源(真实扫描、程序生成、设计师创建),解决了现有数据集的规模、多样性和布局真实性问题。数据集包含约4万场景、196万3D物体,覆盖15种场景类型和288个物体类别,具有复杂布局(平均每区域41.5个物体)。数据处理流程确保了可模拟性、交互性和无碰撞特性。实验展示了其在场景布局生成和点目标导航任务中的价值。

Details Motivation: 现有3D场景数据集在规模、多样性或布局真实性上存在局限,例如缺少小物件或物体碰撞严重,限制了Embodied AI的发展。

Contribution: 提出了InternScenes数据集,整合多种场景来源,覆盖大规模多样场景,保持复杂布局和真实细节,并提供可模拟、交互性强的3D场景数据。

Method: 通过综合真实扫描、程序生成和设计师创建的场景,设计数据处理流程,确保可模拟性(物理仿真)、交互性增加(交互物体)和碰撞解决。

Result: 数据集包含4万场景、196万3D物体,支持场景布局生成和点目标导航任务,展示了复杂场景带来的新挑战。

Insight: 复杂且真实的布局对模型训练提出了更高要求,但InternScenes为相关任务(如布局生成和导航)的规模化训练提供了可能。

Abstract: The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.

[54] Well-Conditioned Polynomial Representations for Mathematical Handwriting Recognition

Robert M. Corless,Deepak Singh Kalhan,Stephen M. Watt

Main category: cs.CV

TL;DR: 本文探讨了Legendre和Chebyshev多项式基在数学手写识别中的条件数及其计算效率,分析了不同基和多项式阶数对建模精度与计算成本的影响。

Details Motivation: 数学手写字迹识别需要高效的几何表示方法,此前的研究主要使用Legendre或Legendre-Sobolev基多项式。本文旨在探讨不同多项式基(如Chebyshev)及其条件数对建模精度和计算成本的影响。

Contribution: 文章比较了Legendre、Legendre-Sobolev、Chebyshev和Chebyshev-Sobolev基多项式在数学手写识别中的性能,分析了条件数和计算效率的权衡。

Method: 通过分析多项式在这些基中的条件数和内积范数,研究不同基和多项式阶数对建模精度的影响。

Result: 研究表明,选择适当的基和多项式阶数可以在保持高精度的同时降低计算成本。

Insight: 条件数是选择多项式基和阶数的重要指标,Legendre基在某些情况下可能更优,而Chebyshev基在计算效率上具有潜力。

Abstract: Previous work has made use of a parameterized plane curve polynomial representation for mathematical handwriting, with the polynomials represented in a Legendre or Legendre-Sobolev graded basis. This provides a compact geometric representation for the digital ink. Preliminary results have also been shown for Chebyshev and Chebyshev-Sobolev bases. This article explores the trade-offs between basis choice and polynomial degree to achieve accurate modeling with a low computational cost. To do this, we consider the condition number for polynomial evaluation in these bases and bound how the various inner products give norms for the variations between symbols.

[55] OpenUrban3D: Annotation-Free Open-Vocabulary Semantic Segmentation of Large-Scale Urban Point Clouds

Chongyu Wang,Kunlei Jing,Jihua Zhu,Di Wang

Main category: cs.CV

TL;DR: OpenUrban3D是一个无需标注的无约束词汇语义分割框架,专用于大规模城市点云数据,克服了缺乏多视角图像和传统3D分割方法泛化性差的限制。

Details Motivation: 尽管无约束词汇语义分割对大规模城市点云数据非常重要(如数字孪生和智慧城市),但由于缺乏高质量多视角图像和现有3D分割方法的泛化性差,该领域研究不足。

Contribution: 提出了首个无需多视角图像对齐、预训练模型或人工标注的3D无约束词汇语义分割框架OpenUrban3D,通过多视角多粒度渲染和视觉语言特征提取实现零样本分割。

Method: 采用多视角多粒度渲染、掩膜级视觉语言特征提取和样本平衡融合,直接从原始点云生成鲁棒语义特征,并通过蒸馏到3D骨干模型中实现分割。

Result: 在大规模城市基准测试(SensatUrban和SUM)上,OpenUrban3D在分割精度和跨场景泛化性上显著优于现有方法。

Insight: OpenUrban3D展示了无需依赖传统标注或多视角数据的灵活性,为3D城市场景理解提供了可扩展的解决方案。

Abstract: Open-vocabulary semantic segmentation enables models to recognize and segment objects from arbitrary natural language descriptions, offering the flexibility to handle novel, fine-grained, or functionally defined categories beyond fixed label sets. While this capability is crucial for large-scale urban point clouds that support applications such as digital twins, smart city management, and urban analytics, it remains largely unexplored in this domain. The main obstacles are the frequent absence of high-quality, well-aligned multi-view imagery in large-scale urban point cloud datasets and the poor generalization of existing three-dimensional (3D) segmentation pipelines across diverse urban environments with substantial variation in geometry, scale, and appearance. To address these challenges, we present OpenUrban3D, the first 3D open-vocabulary semantic segmentation framework for large-scale urban scenes that operates without aligned multi-view images, pre-trained point cloud segmentation networks, or manual annotations. Our approach generates robust semantic features directly from raw point clouds through multi-view, multi-granularity rendering, mask-level vision-language feature extraction, and sample-balanced fusion, followed by distillation into a 3D backbone model. This design enables zero-shot segmentation for arbitrary text queries while capturing both semantic richness and geometric priors. Extensive experiments on large-scale urban benchmarks, including SensatUrban and SUM, show that OpenUrban3D achieves significant improvements in both segmentation accuracy and cross-scene generalization over existing methods, demonstrating its potential as a flexible and scalable solution for 3D urban scene understanding.

[56] AutoOEP – A Multi-modal Framework for Online Exam Proctoring

Aryan Kashyap Naveen,Bhuvanesh Singla,Raajan Wankhade,Shreesha M,Ramu S,Ram Mohana Reddy Guddeti

Main category: cs.CV

TL;DR: 本文提出了AutoOEP,一种多模态的在线考试监考框架,结合计算机视觉和机器学习技术,通过双摄像头捕捉考生行为,利用ArcFace和YOLOv11分别分析面部和手部活动,通过LSTM网络实时计算作弊概率。实验表明,该系统在可疑行为检测和禁止物品检测上表现优异,且资源高效。

Details Motivation: 在线教育的快速发展需要高效、可扩展的监考系统以确保学术诚信。传统人工监考难以规模化,现有自动化解决方案则可能过于侵入性或检测能力有限。

Contribution: 提出AutoOEP框架,结合多模态分析(面部识别、姿态估计、物品检测等),并通过LSTM网络实现实时作弊行为分析。

Method: 系统采用双摄像头设计,集成了ArcFace面部识别、YOLOv11物品检测,以及LSTM网络用于时序行为分析。

Result: 在自定义数据集上,AutoOEP在可疑行为分类上的准确率达90.7%,禁止物品检测的mAP@.5为0.57,且能在无GPU条件下以2.4 FPS处理视频流。

Insight: 多模态融合和时间序列分析是提升自动化监考效果的关键,同时系统展示了在资源受限环境下的可行性。

Abstract: The burgeoning of online education has created an urgent need for robust and scalable systems to ensure academic integrity during remote examinations. Traditional human proctoring is often not feasible at scale, while existing automated solutions can be intrusive or fail to detect a wide range of cheating behaviors. This paper introduces AutoOEP (Automated Online Exam Proctoring), a comprehensive, multi-modal framework that leverages computer vision and machine learning to provide effective, automated proctoring. The system utilizes a dual-camera setup to capture both a frontal view of the examinee and a side view of the workspace, minimizing blind spots. Our approach integrates several parallel analyses: the Face Module performs continuous identity verification using ArcFace, along with head pose estimation, gaze tracking, and mouth movement analysis to detect suspicious cues. Concurrently, the Hand Module employs a fine-tuned YOLOv11 model for detecting prohibited items (e.g., mobile phones, notes) and tracks hand proximity to these objects. Features from these modules are aggregated and fed into a Long Short-Term Memory (LSTM) network that analyzes temporal patterns to calculate a real-time cheating probability score. We evaluate AutoOEP on a custom-collected dataset simulating diverse exam conditions. Our system achieves an accuracy of 90.7% in classifying suspicious activities. The object detection component obtains a mean Average Precision (mAP@.5) of 0.57 for prohibited items, and the entire framework processes video streams at approximately 2.4 frames per second without a GPU. The results demonstrate that AutoOEP is an effective and resource-efficient solution for automated proctoring, significantly reducing the need for human intervention and enhancing the integrity of online assessments.

[57] Total Variation Subgradient Guided Image Fusion for Dual-Camera CASSI System

Weiqiang Zhao,Tianzhu Liu,Yuzhe Gui,Yanfeng Gu

Main category: cs.CV

TL;DR: 该论文提出了一种基于全变分(TV)次梯度的双摄像头CASSI系统图像融合方法,通过动态正则化策略和RGB/全色参考图像引导,解决了传统方法性能有限和深度学习不可解释的问题。

Details Motivation: 传统CASSI系统在高压缩比下存在病态重建问题,传统模型依赖手工先验性能有限,而深度学习方法缺乏物理可解释性。作者希望通过整合TV次梯度理论,提供一个数学严谨且可解释的重建框架。

Contribution: 1. 提出了一种双摄像头CASSI重建框架,整合了TV次梯度理论;
2. 设计了动态正则化策略,利用参考图像提供梯度约束;
3. 建立了端到端的数学模型,降低了逆问题的计算复杂度。

Method: 1. 基于TV次梯度理论,提出了一种严格凸优化保证的相似性函数;
2. 引入动态正则化策略,利用RGB/全色参考图像生成和更新机制;
3. 结合辅助摄像头的空间先验信息,提供次梯度引导。

Result: 实验表明,该方法能有效保持空间-光谱结构一致性,并在多种重建场景下表现稳健。

Insight: 该工作为计算光谱成像提供了一个数学可解释的理论框架,同时结合了传统模型和深度学习的优势,具有广泛的应用潜力。

Abstract: Spectral imaging technology has long-faced fundamental challenges in balancing spectral, spatial, and temporal resolutions. While compressive sensing-based Coded Aperture Snapshot Spectral Imaging (CASSI) mitigates this trade-off through optical encoding, high compression ratios result in ill-posed reconstruction problems. Traditional model-based methods exhibit limited performance due to reliance on handcrafted inherent image priors, while deep learning approaches are constrained by their black-box nature, which compromises physical interpretability. To address these limitations, we propose a dual-camera CASSI reconstruction framework that integrates total variation (TV) subgradient theory. By establishing an end-to-end SD-CASSI mathematical model, we reduce the computational complexity of solving the inverse problem and provide a mathematically well-founded framework for analyzing multi-camera systems. A dynamic regularization strategy is introduced, incorporating normalized gradient constraints from RGB/panchromatic-derived reference images, which constructs a TV subgradient similarity function with strict convex optimization guarantees. Leveraging spatial priors from auxiliary cameras, an adaptive reference generation and updating mechanism is designed to provide subgradient guidance. Experimental results demonstrate that the proposed method effectively preserves spatial-spectral structural consistency. The theoretical framework establishes an interpretable mathematical foundation for computational spectral imaging, demonstrating robust performance across diverse reconstruction scenarios. The source code is available at https://github.com/bestwishes43/ADMM-TVDS.

[58] Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation

Mohanad Albughdadi

Main category: cs.CV

TL;DR: 论文提出了一种轻量级、元数据感知的混合专家掩码自编码器(MoE-MAE),仅需2.5M参数,在Earth Observation任务中表现优异,验证了元数据感知预训练对小模型的提升效果。

Details Motivation: Earth Observation领域的大型基础模型计算成本高,限制了其在下游任务中的可访问性和复用性。作者希望通过轻量级架构探索更实用的通用EO模型。

Contribution: 1. 提出轻量级元数据感知的MoE-MAE模型(2.5M参数);2. 结合稀疏专家路由和地理时序条件;3. 在缺少显式元数据的任务上也表现出竞争力。

Method: 1. 采用混合专家(MoE)架构,结合稀疏路由;2. 引入地理(经纬度)和时序(季节/日循环)编码;3. 基于BigEarthNet-Landsat数据预训练,使用冻结编码器的线性探针评估。

Result: 模型在BigEarthNet-Landsat和EuroSAT-Landsat数据集上表现优于更大规模的模型,验证了元数据感知预训练在标签效率和迁移能力上的优势。

Insight: 轻量级但元数据感知的设计可以有效提升模型性能,为未来EO领域的基础模型提供了一条高效且可扩展的路径。

Abstract: Recent advances in Earth Observation have focused on large-scale foundation models. However, these models are computationally expensive, limiting their accessibility and reuse for downstream tasks. In this work, we investigate compact architectures as a practical pathway toward smaller general-purpose EO models. We propose a Metadata-aware Mixture-of-Experts Masked Autoencoder (MoE-MAE) with only 2.5M parameters. The model combines sparse expert routing with geo-temporal conditioning, incorporating imagery alongside latitude/longitude and seasonal/daily cyclic encodings. We pretrain the MoE-MAE on the BigEarthNet-Landsat dataset and evaluate embeddings from its frozen encoder using linear probes. Despite its small size, the model competes with much larger architectures, demonstrating that metadata-aware pretraining improves transfer and label efficiency. To further assess generalization, we evaluate on the EuroSAT-Landsat dataset, which lacks explicit metadata, and still observe competitive performance compared to models with hundreds of millions of parameters. These results suggest that compact, metadata-aware MoE-MAEs are an efficient and scalable step toward future EO foundation models.

[59] Simulating Sinogram-Domain Motion and Correcting Image-Domain Artifacts Using Deep Learning in HR-pQCT Bone Imaging

Farhan Sadik,Christopher L. Newman,Stuart J. Warden,Rachel K. Surowiec

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的HR-pQCT图像运动伪影校正方法,通过优化投影域运动模拟方法生成配对数据集,并利用ESWGAN-GP模型进行运动修正。

Details Motivation: HR-pQCT图像中刚性运动伪影(如皮质骨条纹和小梁骨模糊)阻碍了骨骼微结构的准确评估,目前缺乏标准化退化模型和运动修正方法。

Contribution: 1) 优化了投影域运动模拟方法,生成配对的运动伪影图像和真实数据;2) 提出ESWGAN-GP模型,结合边缘增强和自注意力机制,有效修正运动伪影。

Method: 1) 在投影域模拟运动伪影生成配对数据集;2) 提出ESWGAN-GP模型,采用边缘增强跳跃连接和自注意力机制;3) 使用VGG感知损失重建微观结构特征。

Result: 模型在模拟数据集上SNR为26.78,SSIM为0.81,VIF为0.76;在真实数据集上性能进一步提升(SNR=29.31,SSIM=0.87,VIF=0.81)。

Insight: 论文为HR-pQCT运动伪影校正提供了初步深度学习解决方案,尽管模拟运动可能不完全反映真实情况,但对模态推广具有重要意义。

Abstract: Rigid-motion artifacts, such as cortical bone streaking and trabecular smearing, hinder in vivo assessment of bone microstructures in high-resolution peripheral quantitative computed tomography (HR-pQCT). Despite various motion grading techniques, no motion correction methods exist due to the lack of standardized degradation models. We optimize a conventional sinogram-based method to simulate motion artifacts in HR-pQCT images, creating paired datasets of motion-corrupted images and their corresponding ground truth, which enables seamless integration into supervised learning frameworks for motion correction. As such, we propose an Edge-enhanced Self-attention Wasserstein Generative Adversarial Network with Gradient Penalty (ESWGAN-GP) to address motion artifacts in both simulated (source) and real-world (target) datasets. The model incorporates edge-enhancing skip connections to preserve trabecular edges and self-attention mechanisms to capture long-range dependencies, facilitating motion correction. A visual geometry group (VGG)-based perceptual loss is used to reconstruct fine micro-structural features. The ESWGAN-GP achieves a mean signal-to-noise ratio (SNR) of 26.78, structural similarity index measure (SSIM) of 0.81, and visual information fidelity (VIF) of 0.76 for the source dataset, while showing improved performance on the target dataset with an SNR of 29.31, SSIM of 0.87, and VIF of 0.81. The proposed methods address a simplified representation of real-world motion that may not fully capture the complexity of in vivo motion artifacts. Nevertheless, because motion artifacts present one of the foremost challenges to more widespread adoption of this modality, these methods represent an important initial step toward implementing deep learning-based motion correction in HR-pQCT.

[60] Gaze Authentication: Factors Influencing Authentication Performance

Dillon Lohr,Michael J Proulx,Mehedi Hasan Raju,Oleg V Komogortsev

Main category: cs.CV

TL;DR: 论文研究了影响最先进视线认证性能的关键因素,通过大规模实验发现校准目标深度的一致性、融合校准与非校准视线以及提高信号质量均可提升认证性能。

Details Motivation: 探索视线认证技术中影响性能的关键因素,以提升认证的准确性和可靠性。

Contribution: 确定了校准目标深度、信号质量融合和信号质量改进对视线认证性能的积极影响,并发现简单移动平均滤波器可能略微降低性能。

Method: 使用最先进的神经网络架构在大规模数据集上实验,分析视线信号质量、校准和滤波的影响。

Result: 校准目标深度一致、融合校准与非校准视线、改进信号质量均提升认证性能,而三样本移动平均滤波器略微降低性能。

Insight: 视线认证的性能优化需综合考虑校准策略和信号质量,简单的滤波方法可能不适用于所有场景。

Abstract: This paper examines the key factors that influence the performance of state-of-the-art gaze-based authentication. Experiments were conducted on a large-scale, in-house dataset comprising 8,849 subjects collected with Meta Quest Pro equivalent hardware running a video oculography-driven gaze estimation pipeline at 72Hz. The state-of-the-art neural network architecture was employed to study the influence of the following factors on authentication performance: eye tracking signal quality, various aspects of eye tracking calibration, and simple filtering on estimated raw gaze. We found that using the same calibration target depth for eye tracking calibration, fusing calibrated and non-calibrated gaze, and improving eye tracking signal quality all enhance authentication performance. We also found that a simple three-sample moving average filter slightly reduces authentication performance in general. While these findings hold true for the most part, some exceptions were noted.

[61] TrueSkin: Towards Fair and Accurate Skin Tone Recognition and Generation

Haoming Lu

Main category: cs.CV

TL;DR: 论文《TrueSkin》介绍了数据集TrueSkin,旨在提升肤色识别与生成的公平性和准确性。通过分析现有方法的偏见,发现大型多模态模型(LMMs)和图像生成模型在肤色问题上表现不佳,而在TrueSkin上训练显著提升了性能。

Details Motivation: 肤色识别和生成在模型公平性、医疗和生成式AI中非常重要,但现有方法和数据集存在不足,导致性能低下和公平性问题。

Contribution: 1. 引入了TrueSkin数据集,包含7299张图像,系统分类为6类;2. 揭示了LMMs和生成模型在肤色任务中的偏见;3. 展示了TrueSkin在提升识别和生成模型性能上的有效性。

Method: 1. 构建TrueSkin数据集,涵盖多种光照、拍摄角度和设置;2. 对现有识别和生成方法进行基准测试;3. 在TrueSkin上训练和微调模型。

Result: 训练于TrueSkin的识别模型比LMMs和传统方法准确率高20%;微调后生成模型的肤色保真度显著提升。

Insight: 肤色任务的准确性需要系统性数据集支持,TrueSkin不仅可作为基准,还能通过训练改善模型公平性和性能。

Abstract: Skin tone recognition and generation play important roles in model fairness, healthcare, and generative AI, yet they remain challenging due to the lack of comprehensive datasets and robust methodologies. Compared to other human image analysis tasks, state-of-the-art large multimodal models (LMMs) and image generation models struggle to recognize and synthesize skin tones accurately. To address this, we introduce TrueSkin, a dataset with 7299 images systematically categorized into 6 classes, collected under diverse lighting conditions, camera angles, and capture settings. Using TrueSkin, we benchmark existing recognition and generation approaches, revealing substantial biases: LMMs tend to misclassify intermediate skin tones as lighter ones, whereas generative models struggle to accurately produce specified skin tones when influenced by inherent biases from unrelated attributes in the prompts, such as hairstyle or environmental context. We further demonstrate that training a recognition model on TrueSkin improves classification accuracy by more than 20% compared to LMMs and conventional approaches, and fine-tuning with TrueSkin significantly improves skin tone fidelity in image generation models. Our findings highlight the need for comprehensive datasets like TrueSkin, which not only serves as a benchmark for evaluating existing models but also provides a valuable training resource to enhance fairness and accuracy in skin tone recognition and generation tasks.

[62] Policy-Driven Transfer Learning in Resource-Limited Animal Monitoring

Nisha Pillai,Aditi Virupakshaiah,Harrison W. Smith,Amanda J. Ashworth,Prasanna Gowda,Phillip R. Owens,Adam R. Rivers,Bindu Nanduri,Mahalingam Ramkumar

Main category: cs.CV

TL;DR: 论文提出了一种基于强化学习的迁移学习框架,通过UCB算法自动选择适用于动物监测任务的最佳预训练模型,显著提高了检测率并减少了计算时间。

Details Motivation: 在动物健康和种群管理中,自动检测与跟踪系统依赖计算机视觉和UAV技术,但标记数据稀缺阻碍了有效的深度学习模型开发。迁移学习能够利用预训练模型解决这一问题,但如何选择最优模型仍具有挑战性。

Contribution: 提出了一种基于强化学习的框架,用于自动选择最佳预训练模型,并结合UCB算法优化模型选择过程。

Method: 采用强化学习框架,通过UCB算法评估和排序候选预训练模型的性能,自动筛选最适合动物检测任务的模型。

Result: 实验表明,该方法在检测率上优于传统方法,同时显著减少了计算时间。

Insight: 通过强化学习和UCB算法,可以高效解决资源有限场景下的迁移学习模型选择问题,为动物监测等领域提供了新的工具。

Abstract: Animal health monitoring and population management are critical aspects of wildlife conservation and livestock management that increasingly rely on automated detection and tracking systems. While Unmanned Aerial Vehicle (UAV) based systems combined with computer vision offer promising solutions for non-invasive animal monitoring across challenging terrains, limited availability of labeled training data remains an obstacle in developing effective deep learning (DL) models for these applications. Transfer learning has emerged as a potential solution, allowing models trained on large datasets to be adapted for resource-limited scenarios such as those with limited data. However, the vast landscape of pre-trained neural network architectures makes it challenging to select optimal models, particularly for researchers new to the field. In this paper, we propose a reinforcement learning (RL)-based transfer learning framework that employs an upper confidence bound (UCB) algorithm to automatically select the most suitable pre-trained model for animal detection tasks. Our approach systematically evaluates and ranks candidate models based on their performance, streamlining the model selection process. Experimental results demonstrate that our framework achieves a higher detection rate while requiring significantly less computational time compared to traditional methods.

[63] Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection

Canhui Tang,Sanping Zhou,Haoyue Shi,Le Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于骨架的零样本视频异常检测框架,通过动作典型性和上下文独特性学习,提升了跨域泛化能力,无需目标域训练数据。

Details Motivation: 解决现有基于骨架的方法在零样本视频异常检测中泛化能力不足的问题,主要原因是仅依赖低级别骨架表示和领域有限的正常边界。

Contribution: 1) 引入语言引导的语义典型性建模模块,利用LLM知识区分正常与异常行为;2) 提出测试时上下文独特性分析模块,自适应生成场景边界。

Method: 结合动作典型性(语义空间投影与LLM知识蒸馏)和上下文独特性(时空差异分析),实现无需目标域训练的零样本异常检测。

Result: 在四个大规模VAD数据集(ShanghaiTech、UBnormal、NWPU、UCF-Crime)上取得最优性能,涵盖100多个未见过的监控场景。

Insight: 利用语义信息和上下文分析可以有效解决骨架数据的跨域泛化问题,且无需目标域训练数据。

Abstract: Zero-Shot Video Anomaly Detection (ZS-VAD) requires temporally localizing anomalies without target domain training data, which is a crucial task due to various practical concerns, e.g., data privacy or new surveillance deployments. Skeleton-based approach has inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-limited normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning. Firstly, we introduce a language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space and distills LLM’s knowledge of typical normal and abnormal behaviors during training. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive scene-adaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime, featuring over 100 unseen surveillance scenes.

[64] Organoid Tracker: A SAM2-Powered Platform for Zero-shot Cyst Analysis in Human Kidney Organoid Videos

Xiaoyu Huang,Lauren M Maxson,Trang Nguyen,Cheng Jack Song,Yuankai Huo

Main category: cs.CV

TL;DR: 本文介绍了一种名为Organoid Tracker的平台,基于SAM2模型实现零样本分割,用于分析人类肾脏类器官视频,提供了自动化的囊肿定量分析功能。

Details Motivation: 现有手动分析方法对肾脏类器官视频的分析局限,无法充分利用像素级和时空信息,亟需自动化工具提升效率和精度。

Contribution: 开发了开源的Organoid Tracker平台,基于SAM2实现零样本分割,支持无编程经验的用户进行自动化时空分析。

Method: 平台采用模块化插件架构,结合SAM2模型,实现了零样本分割和自动化定量分析(如囊肿形成速率、形态变化等)。

Result: 平台能够生成详细报告,量化囊肿关键指标,显著提升研究效率,适用于肾脏发育、PKD建模和药物发现等领域。

Insight: 结合视觉基础模型(SAM2)和模块化平台设计,为生物医学研究提供了高效、通用的自动化分析工具。

Abstract: Recent advances in organoid models have revolutionized the study of human kidney disease mechanisms and drug discovery by enabling scalable, cost-effective research without the need for animal sacrifice. Here, we present a kidney organoid platform optimized for efficient screening in polycystic kidney disease (PKD). While these systems generate rich spatial-temporal microscopy video datasets, current manual approaches to analysis remain limited to coarse classifications (e.g., hit vs. non-hit), often missing valuable pixel-level and longitudinal information. To help overcome this bottleneck, we developed Organoid Tracker, a graphical user interface (GUI) platform designed with a modular plugin architecture, which empowers researchers to extract detailed, quantitative metrics without programming expertise. Built on the cutting-edge vision foundation model Segment Anything Model 2 (SAM2), Organoid Tracker enables zero-shot segmentation and automated analysis of spatial-temporal microscopy videos. It quantifies key metrics such as cyst formation rate, growth velocity, and morphological changes, while generating comprehensive reports. By providing an extensible, open-source framework, Organoid Tracker offers a powerful solution for improving and accelerating research in kidney development, PKD modeling, and therapeutic discovery. The platform is publicly available as open-source software at https://github.com/hrlblab/OrganoidTracker.

[65] The System Description of CPS Team for Track on Driving with Language of CVPR 2024 Autonomous Grand Challenge

Jinghan Peng,Jingwen Wang,Xing Yu,Dehui Du

Main category: cs.CV

TL;DR: 本文介绍了CVPR 2024自动驾驶挑战赛中‘语言驱动’赛道的系统解决方案,基于LLaVA模型并结合LoRA和DoRA方法优化,引入深度信息提升性能,采用Chain-of-Thought推理方法,最终在验证集上取得0.7799的最高分,排名第一。

Details Motivation: 自动驾驶与自然语言处理的结合是新兴研究方向,通过语言指令控制自动驾驶系统有广泛应用前景。本文旨在利用视觉语言模型提升系统性能。

Contribution: 1. 基于LLaVA模型进行优化,结合LoRA和DoRA方法;2. 引入开源深度估计模型的深度信息;3. 采用Chain-of-Thought推理提升问答准确性。

Method: 1. 使用DriveLM-nuScenes数据集训练;2. 结合LoRA和DoRA方法对LLaVA模型进行微调;3. 集成深度信息;4. 在推理中采用Chain-of-Thought方法。

Result: 在验证集上取得0.7799的最高分,排名第一。

Insight: 1. 结合深度信息和语言模型能显著提升性能;2. Chain-of-Thought推理在多选题和是非题中效果显著;3. LoRA和DoRA方法在模型微调中具有潜力。

Abstract: This report outlines our approach using vision language model systems for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We have exclusively utilized the DriveLM-nuScenes dataset for training our models. Our systems are built on the LLaVA models, which we enhanced through fine-tuning with the LoRA and DoRA methods. Additionally, we have integrated depth information from open-source depth estimation models to enrich the training and inference processes. For inference, particularly with multiple-choice and yes/no questions, we adopted a Chain-of-Thought reasoning approach to improve the accuracy of the results. This comprehensive methodology enabled us to achieve a top score of 0.7799 on the validation set leaderboard, ranking 1st on the leaderboard.

[66] Mars Traversability Prediction: A Multi-modal Self-supervised Approach for Costmap Generation

Zongwu Xie,Kaijie Yun,Yang Liu,Yiming Ji,Han Li

Main category: cs.CV

TL;DR: 该论文提出了一种多模态自监督框架,用于预测行星探测车的可通行性成本图,融合相机和LiDAR数据,通过IMU标签训练,展示了模型的鲁棒性和几何主导的学习特性。

Details Motivation: 行星探测车在执行任务时需要准确预测地形的可通行性,传统方法依赖于人工标注或仿真环境,但成本高且泛化性差。本研究旨在通过多模态数据融合和自监督学习,实现高效、低成本的可通行性预测。

Contribution: 论文的主要贡献包括:(1) 提供了一个高保真、可复现的仿真环境;(2) 开发了一种基于IMU的自监督标签生成流水线;(3) 提出了一个强大的多模态BEV成本图预测模型。

Method: 方法包括:(1) 使用DINOv3作为图像编码器;(2) 采用FiLM技术实现传感器融合;(3) 设计了结合Huber和平滑项的优化损失函数。通过输入遮挡、噪声添加等实验验证了模型的鲁棒性。

Result: 实验结果表明,模型的MAE/MSE变化较小(如LiDAR数据稀疏化时MAE从0.0775增至0.0915),说明模型对几何信息的依赖性较强,且具有很高的鲁棒性。

Insight: 研究发现,IMU标签主要反映地形几何而非语义信息,且当前数据多样性有限。未来工作可围绕领域泛化和数据集扩展展开。

Abstract: We present a robust multi-modal framework for predicting traversability costmaps for planetary rovers. Our model fuses camera and LiDAR data to produce a bird’s-eye-view (BEV) terrain costmap, trained self-supervised using IMU-derived labels. Key updates include a DINOv3-based image encoder, FiLM-based sensor fusion, and an optimization loss combining Huber and smoothness terms. Experimental ablations (removing image color, occluding inputs, adding noise) show only minor changes in MAE/MSE (e.g. MAE increases from ~0.0775 to 0.0915 when LiDAR is sparsified), indicating that geometry dominates the learned cost and the model is highly robust. We attribute the small performance differences to the IMU labeling primarily reflecting terrain geometry rather than semantics and to limited data diversity. Unlike prior work claiming large gains, we emphasize our contributions: (1) a high-fidelity, reproducible simulation environment; (2) a self-supervised IMU-based labeling pipeline; and (3) a strong multi-modal BEV costmap prediction model. We discuss limitations and future work such as domain generalization and dataset expansion.

[67] End-to-End Visual Autonomous Parking via Control-Aided Attention

Chao Chen,Shunyu Yao,Yuanwu He,Tao Feng,Ruojing Song,Yuliang Guo,Xinyu Huang,Chenxu Wu,Ren Liu,Chen Feng

Main category: cs.CV

TL;DR: 提出了CAA-Policy,一种端到端模仿学习系统,通过控制辅助注意机制(CAA)实现视觉自主停车,显著提升了政策的鲁棒性和泛化性。

Details Motivation: 现有端到端学习方法在感知与控制间缺乏有效协同,导致注意力不稳定,影响政策决策的可靠性。

Contribution: 1. 提出控制辅助注意机制(CAA),利用控制信号引导注意力学习;2. 首次以自监督方式训练注意力模块;3. 引入辅助任务和运动预测模块以增强稳定性。

Method: 基于Transformer架构,利用CAA机制将控制信号的梯度反向传播引导注意力学习,并结合短时程路径点预测和运动预测模块。

Result: 在CARLA模拟器中,CAA-Policy在准确性、鲁棒性和可解释性上优于端到端学习和模块化基准方法。

Insight: 注意力机制应从控制信号而非训练损失中学习,以捕捉对动作输出敏感的视觉特征,从而提升政策的实际性能。

Abstract: Precise parking requires an end-to-end system where perception adaptively provides policy-relevant details-especially in critical areas where fine control decisions are essential. End-to-end learning offers a unified framework by directly mapping sensor inputs to control actions, but existing approaches lack effective synergy between perception and control. We find that transformer-based self-attention, when used alone, tends to produce unstable and temporally inconsistent spatial attention, which undermines the reliability of downstream policy decisions over time. Instead, we propose CAA-Policy, an end-to-end imitation learning system that allows control signal to guide the learning of visual attention via a novel Control-Aided Attention (CAA) mechanism. For the first time, we train such an attention module in a self-supervised manner, using backpropagated gradients from the control outputs instead of from the training loss. This strategy encourages the attention to focus on visual features that induce high variance in action outputs, rather than merely minimizing the training loss-a shift we demonstrate leads to a more robust and generalizable policy. To further enhance stability, CAA-Policy integrates short-horizon waypoint prediction as an auxiliary task, and introduces a separately trained motion prediction module to robustly track the target spot over time. Extensive experiments in the CARLA simulator show that \titlevariable~consistently surpasses both the end-to-end learning baseline and the modular BEV segmentation + hybrid A* pipeline, achieving superior accuracy, robustness, and interpretability. Code is released at https://github.com/Joechencc/CAAPolicy.

[68] PanoLora: Bridging Perspective and Panoramic Video Generation with LoRA Adaptation

Zeyu Dong,Yuyang Yin,Yuqi Li,Eric Li,Hao-Xiang Guo,Yikai Wang

Main category: cs.CV

TL;DR: PanoLora提出了一种基于LoRA适配的方法,将全景视频生成视为从传统视角到全景视角的适配问题,高效地微调预训练视频扩散模型,生成了高质量的全景视频。

Details Motivation: 高质量360度全景视频生成的挑战在于传统视角与全景投影之间的差异。现有方法往往需要复杂架构或大规模训练,效率低且效果不理想。受LoRA在风格迁移中的成功启发,作者将全景视频生成建模为视角适配问题。

Contribution: 1. 首次将LoRA应用于全景视频生成任务;2. 理论证明在适配任务的自由度范围内,LoRA能有效建模视角变换;3. 仅需约1,000个视频微调预训练模型,实现高质量全景生成。

Method: 1. 将全景视频生成任务建模为从传统视角到全景的适配问题;2. 使用LoRA方法微调预训练视频扩散模型;3. 通过理论分析确定LoRA的秩需超过任务的自由度。

Result: 实验表明,PanoLora在视觉质量、左右一致性和运动多样性上优于现有方法,同时保持了正确的投影几何。

Insight: LoRA不仅适用于风格迁移等任务,还可用于复杂投影变换问题。对适配任务自由度的理论分析为LoRA的秩选择提供了依据。

Abstract: Generating high-quality 360{\deg} panoramic videos remains a significant challenge due to the fundamental differences between panoramic and traditional perspective-view projections. While perspective videos rely on a single viewpoint with a limited field of view, panoramic content requires rendering the full surrounding environment, making it difficult for standard video generation models to adapt. Existing solutions often introduce complex architectures or large-scale training, leading to inefficiency and suboptimal results. Motivated by the success of Low-Rank Adaptation (LoRA) in style transfer tasks, we propose treating panoramic video generation as an adaptation problem from perspective views. Through theoretical analysis, we demonstrate that LoRA can effectively model the transformation between these projections when its rank exceeds the degrees of freedom in the task. Our approach efficiently fine-tunes a pretrained video diffusion model using only approximately 1,000 videos while achieving high-quality panoramic generation. Experimental results demonstrate that our method maintains proper projection geometry and surpasses previous state-of-the-art approaches in visual quality, left-right consistency, and motion diversity.

[69] 3DAeroRelief: The first 3D Benchmark UAV Dataset for Post-Disaster Assessment

Nhut Le,Ehsan Karimi,Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: 这篇论文介绍了首个针对灾后评估的3D无人机基准数据集3DAeroRelief,填补了现有3D数据集中在城市或室内场景的空缺,并通过实验展示了其在3D语义分割中的潜力。

Details Motivation: 现有的自然灾难分析主要依赖二维图像,缺乏深度信息且易受遮挡影响。3D语义分割提供了更丰富的场景理解方式,但目前缺乏针对灾后场景的3D基准数据集。

Contribution: 1. 提出了首个用于灾后评估的3D无人机数据集3DAeroRelief;2. 数据集通过低成本无人机收集,覆盖真实的飓风受损区域,包含密集3D点云和精细语义标注;3. 展示了该数据集在3D场景理解中的挑战与机遇。

Method: 1. 使用无人机采集受灾区域数据;2. 通过Structure-from-Motion和多视立体技术重建密集3D点云;3. 对2D图像进行手动标注并投影到3D空间;4. 评估多个先进的3D分割模型。

Result: 数据集为灾后3D场景理解提供了宝贵资源,实验展示了现有模型在新数据集上的性能挑战,推动了3D视觉系统在真实灾后场景中的应用。

Insight: 低成本无人机结合3D重建技术为灾后评估提供了灵活、安全的数据采集方式,有望进一步推动3D视觉在紧急响应领域的应用。

Abstract: Timely assessment of structural damage is critical for disaster response and recovery. However, most prior work in natural disaster analysis relies on 2D imagery, which lacks depth, suffers from occlusions, and provides limited spatial context. 3D semantic segmentation offers a richer alternative, but existing 3D benchmarks focus mainly on urban or indoor scenes, with little attention to disaster-affected areas. To address this gap, we present 3DAeroRelief–the first 3D benchmark dataset specifically designed for post-disaster assessment. Collected using low-cost unmanned aerial vehicles (UAVs) over hurricane-damaged regions, the dataset features dense 3D point clouds reconstructed via Structure-from-Motion and Multi-View Stereo techniques. Semantic annotations were produced through manual 2D labeling and projected into 3D space. Unlike existing datasets, 3DAeroRelief captures 3D large-scale outdoor environments with fine-grained structural damage in real-world disaster contexts. UAVs enable affordable, flexible, and safe data collection in hazardous areas, making them particularly well-suited for emergency scenarios. To demonstrate the utility of 3DAeroRelief, we evaluate several state-of-the-art 3D segmentation models on the dataset to highlight both the challenges and opportunities of 3D scene understanding in disaster response. Our dataset serves as a valuable resource for advancing robust 3D vision systems in real-world applications for post-disaster scenarios.

[70] Filling the Gaps: A Multitask Hybrid Multiscale Generative Framework for Missing Modality in Remote Sensing Semantic Segmentation

Nhi Kieu,Kien Nguyen,Arnold Wiliem,Clinton Fookes,Sridha Sridharan

Main category: cs.CV

TL;DR: 论文针对多模态遥感语义分割中模态缺失问题,提出一种多任务混合多尺度生成框架GEMMNet,结合HyFEx、HyFMA和CoLoss,显著提升了生成模型的性能。

Details Motivation: 现实场景中,多模态信号常因传感器故障或恶劣天气缺失,导致模型性能下降。现有生成方法在处理多模态数据异质性时表现不足,亟需新方法解决。

Contribution: 提出了GEMMNet框架,包含HyFEx特征提取器、HyFMA多尺度融合模块和CoLoss互补损失,有效缓解模态缺失带来的性能下降和偏差问题。

Method: 1) HyFEx提取模态特定特征;2) HyFMA实现多尺度模态融合;3) CoLoss通过多任务一致性减轻偏差。

Result: 在Vaihingen和Potsdam数据集上,GEMMNet超越AE、cGAN及非生成方法mmformer和shaspec。

Insight: 多尺度融合和互补损失设计是关键,有助于提升生成模型在复杂场景中的语义理解能力和鲁棒性。

Abstract: Multimodal learning has shown significant performance boost compared to ordinary unimodal models across various domains. However, in real-world scenarios, multimodal signals are susceptible to missing because of sensor failures and adverse weather conditions, which drastically deteriorates models’ operation and performance. Generative models such as AutoEncoder (AE) and Generative Adversarial Network (GAN) are intuitive solutions aiming to reconstruct missing modality from available ones. Yet, their efficacy in remote sensing semantic segmentation remains underexplored. In this paper, we first examine the limitations of existing generative approaches in handling the heterogeneity of multimodal remote sensing data. They inadequately capture semantic context in complex scenes with large intra-class and small inter-class variation. In addition, traditional generative models are susceptible to heavy dependence on the dominant modality, introducing bias that affects model robustness under missing modality conditions. To tackle these limitations, we propose a novel Generative-Enhanced MultiModal learning Network (GEMMNet) with three key components: (1) Hybrid Feature Extractor (HyFEx) to effectively learn modality-specific representations, (2) Hybrid Fusion with Multiscale Awareness (HyFMA) to capture modality-synergistic semantic context across scales and (3) Complementary Loss (CoLoss) scheme to alleviate the inherent bias by encouraging consistency across modalities and tasks. Our method, GEMMNet, outperforms both generative baselines AE, cGAN (conditional GAN), and state-of-the-art non-generative approaches - mmformer and shaspec - on two challenging semantic segmentation remote sensing datasets (Vaihingen and Potsdam). Source code is made available.

[71] WildSmoke: Ready-to-Use Dynamic 3D Smoke Assets from a Single Video in the Wild

Yuqiu Liu,Jialin Song,Manolis Savva,Wuyang Chen

Main category: cs.CV

TL;DR: 该论文提出了一种从单段野外视频中提取和重建动态3D烟雾资产的流程,并支持交互式烟雾设计与编辑。

Details Motivation: 当前流体动态重建主要依赖实验室环境,而真实世界视频中的烟雾重建尚未充分探索。论文旨在解决这一空白。

Contribution: 1. 设计了一种从野外视频中重建高质量3D烟雾的流程;2. 实现了交互式烟雾编辑和模拟;3. 在质量上超越现有方法(PSNR +2.22)。

Method: 1. 通过背景去除提取烟雾;2. 初始化烟雾粒子和相机位姿;3. 推断多视角视频;4. 结合模拟技术实现编辑。

Result: 在野外视频上实现了高质量烟雾重建(PSNR +2.22),并支持多样且逼真的流体动态编辑。

Insight: 该方法突破了实验室限制,为真实世界视频中的动态3D烟雾重建和编辑提供了实用工具。

Abstract: We propose a pipeline to extract and reconstruct dynamic 3D smoke assets from a single in-the-wild video, and further integrate interactive simulation for smoke design and editing. Recent developments in 3D vision have significantly improved reconstructing and rendering fluid dynamics, supporting realistic and temporally consistent view synthesis. However, current fluid reconstructions rely heavily on carefully controlled clean lab environments, whereas real-world videos captured in the wild are largely underexplored. We pinpoint three key challenges of reconstructing smoke in real-world videos and design targeted techniques, including smoke extraction with background removal, initialization of smoke particles and camera poses, and inferring multi-view videos. Our method not only outperforms previous reconstruction and generation methods with high-quality smoke reconstructions (+2.22 average PSNR on wild videos), but also enables diverse and realistic editing of fluid dynamics by simulating our smoke assets. We provide our models, data, and 4D smoke assets at https://autumnyq.github.io/WildSmoke.

[72] SVR-GS: Spatially Variant Regularization for Probabilistic Masks in 3D Gaussian Splatting

Ashkan Taghipour,Vahid Naghshin,Benjamin Southwell,Farid Boussaid,Hamid Laga,Mohammed Bennamoun

Main category: cs.CV

TL;DR: SVR-GS提出一种空间变异正则化方法,通过每个高斯在射线上的有效贡献生成像素级空间掩码,显著减少高斯数量,同时保持图像质量。

Details Motivation: 现有的3D高斯溅射方法(如MaskGS)采用全局掩码平均正则化,与局部像素级重建损失不匹配,导致高斯数量优化不足。

Contribution: 提出SVR-GS,引入空间变异正则化器,生成像素级掩码,对低重要性高斯施加稀疏压力,显著减少高斯数量并提升效率。

Method: 设计三种空间掩码聚合策略,通过CUDA实现,并进行梯度分析以优化最终设计。

Result: 在Tanks&Temples等数据集上,SVR-GS平均减少高斯数量1.79倍(相比MaskGS)和5.63倍(相比3DGS),PSNR仅下降0.50 dB和0.40 dB。

Insight: 空间变异正则化更符合局部重建损失的优化目标,使模型更小、更快且适合实时应用。

Abstract: 3D Gaussian Splatting (3DGS) enables fast, high-quality novel view synthesis but typically relies on densification followed by pruning to optimize the number of Gaussians. Existing mask-based pruning, such as MaskGS, regularizes the global mean of the mask, which is misaligned with the local per-pixel (per-ray) reconstruction loss that determines image quality along individual camera rays. This paper introduces SVR-GS, a spatially variant regularizer that renders a per-pixel spatial mask from each Gaussian’s effective contribution along the ray, thereby applying sparsity pressure where it matters: on low-importance Gaussians. We explore three spatial-mask aggregation strategies, implement them in CUDA, and conduct a gradient analysis to motivate our final design. Extensive experiments on Tanks&Temples, Deep Blending, and Mip-NeRF360 datasets demonstrate that, on average across the three datasets, the proposed SVR-GS reduces the number of Gaussians by 1.79(\times) compared to MaskGS and 5.63(\times) compared to 3DGS, while incurring only 0.50 dB and 0.40 dB PSNR drops, respectively. These gains translate into significantly smaller, faster, and more memory-efficient models, making them well-suited for real-time applications such as robotics, AR/VR, and mobile perception.

[73] Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic

Waikit Xiu,Qiang Lu,Xiying Li,Chen Hu,Shengbo Sun

Main category: cs.CV

TL;DR: 论文提出了一种名为Traffic-MLLM的多模态大语言模型,用于交通视频的细粒度分析和因果推理,通过检索增强生成和链式思维推理提升了模型在复杂场景中的表现。

Details Motivation: 交通视频理解在智能交通系统中至关重要,但现有方法在建模时空因果关系和整合领域知识方面存在不足。

Contribution: 1) 提出Traffic-MLLM模型,基于Qwen2.5-VL骨干网络;2) 引入检索增强生成和链式思维推理的知识提示模块;3) 在TrafficQA和DriveQA基准上实现最先进性能。

Method: 1) 使用LoRA进行轻量级微调;2) 结合高度优化的交通多模态数据集;3) 设计融合CoT和RAG的知识提示模块。

Result: 在TrafficQA和DriveQA基准上表现优异,并展现出零样本推理和跨场景泛化能力。

Insight: 通过检索增强生成和链式思维推理显着提升模型的逻辑推理和领域知识适应性,为交通视频理解提供了新思路。

Abstract: As intelligent transportation systems advance, traffic video understanding plays an increasingly pivotal role in comprehensive scene perception and causal analysis. Yet, existing approaches face notable challenges in accurately modeling spatiotemporal causality and integrating domain-specific knowledge, limiting their effectiveness in complex scenarios. To address these limitations, we propose Traffic-MLLM, a multimodal large language model tailored for fine-grained traffic analysis. Built on the Qwen2.5-VL backbone, our model leverages high-quality traffic-specific multimodal datasets and uses Low-Rank Adaptation (LoRA) for lightweight fine-tuning, significantly enhancing its capacity to model continuous spatiotemporal features in video sequences. Furthermore, we introduce an innovative knowledge prompting module fusing Chain-of-Thought (CoT) reasoning with Retrieval-Augmented Generation (RAG), enabling precise injection of detailed traffic regulations and domain knowledge into the inference process. This design markedly boosts the model’s logical reasoning and knowledge adaptation capabilities. Experimental results on TrafficQA and DriveQA benchmarks show Traffic-MLLM achieves state-of-the-art performance, validating its superior ability to process multimodal traffic data. It also exhibits remarkable zero-shot reasoning and cross-scenario generalization capabilities.

[74] Multispectral-NeRF:a multispectral modeling approach based on neural radiance fields

Hong Zhang,Fei Guo,Zihan Xie,Dizhao Yao

Main category: cs.CV

TL;DR: 论文提出了Multispectral-NeRF,一种基于NeRF的多光谱建模方法,通过扩展隐藏层维度、重新设计残差函数和调整数据压缩模块,成功解决了现有NeRF模型无法处理多波段信息的问题。

Details Motivation: 传统3D重建技术依赖RGB光谱信息,而现有多光谱数据整合方法成本高、精度低、几何特征差。基于NeRF的方法虽能解决这些问题,但现有NeRF模型仅支持三波段数据,无法利用多波段信息。

Contribution: 1. 扩展隐藏层维度以支持6波段光谱输入;2. 重新设计残差函数以优化重建图像与参考图像的光谱差异计算;3. 调整数据压缩模块以适应多光谱图像的高比特深度需求。

Method: 提出Multispectral-NeRF,一种改进的神经架构,通过隐藏层扩展、残差函数优化和数据压缩模块适配,实现多光谱信息的高效整合。

Result: 实验证实Multispectral-NeRF能成功处理多波段光谱特征,并准确保留原始场景的光谱特性。

Insight: 多光谱数据为3D重建提供了更丰富的信息,而基于NeRF的扩展模型可以高效利用这些信息,提升重建精度和质量。

Abstract: 3D reconstruction technology generates three-dimensional representations of real-world objects, scenes, or environments using sensor data such as 2D images, with extensive applications in robotics, autonomous vehicles, and virtual reality systems. Traditional 3D reconstruction techniques based on 2D images typically relies on RGB spectral information. With advances in sensor technology, additional spectral bands beyond RGB have been increasingly incorporated into 3D reconstruction workflows. Existing methods that integrate these expanded spectral data often suffer from expensive scheme prices, low accuracy and poor geometric features. Three - dimensional reconstruction based on NeRF can effectively address the various issues in current multispectral 3D reconstruction methods, producing high - precision and high - quality reconstruction results. However, currently, NeRF and some improved models such as NeRFacto are trained on three - band data and cannot take into account the multi - band information. To address this problem, we propose Multispectral-NeRF, an enhanced neural architecture derived from NeRF that can effectively integrates multispectral information. Our technical contributions comprise threefold modifications: Expanding hidden layer dimensionality to accommodate 6-band spectral inputs; Redesigning residual functions to optimize spectral discrepancy calculations between reconstructed and reference images; Adapting data compression modules to address the increased bit-depth requirements of multispectral imagery. Experimental results confirm that Multispectral-NeRF successfully processes multi-band spectral features while accurately preserving the original scenes’ spectral characteristics.

[75] SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion

Zhiwen Yang,Yuxin Peng

Main category: cs.CV

TL;DR: SPHERE提出了一种结合体素和高斯表示的语义-物理联合表示方法,用于3D语义场景补全,通过语义引导的高斯初始化和物理感知的谐波增强模块,实现了高精度几何细节和语义一致性的场景补全。

Details Motivation: 现有3D语义场景补全方法在捕捉物理规律和语义精度之间存在矛盾,体素和平面方法难以表现真实几何细节,而神经渲染方法(如NeRF和3DGS)虽有物理感知优势但计算成本高且语义精度不足。SPHERE旨在解决这些问题。

Contribution: 1) 提出了SPHERE框架,联合利用体素和高斯表示;2) 设计了SGI模块,通过语义引导高效初始化高斯分布;3) 开发了PHE模块,基于球谐函数增强物理感知细节和语义-几何一致性。

Method: SPHERE结合体素和高斯表示,SGI模块用双分支3D场景表示定位锚点以引导高斯初始化,PHE模块通过球谐函数建模物理上下文并实现分布对齐。

Result: 在SemanticKITTI和SSCBench-KITTI-360基准测试中表现优异,验证了方法的有效性。

Insight: 联合语义与物理表示能显著提升场景补全质量,且球谐函数在建模物理感知细节方面具有潜力。

Abstract: Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.

[76] StegOT: Trade-offs in Steganography via Optimal Transport

Chengde Lin,Xuezhu Gong,Shuxue Ding,Mingzhe Yang,Xijun Lu,Chengjun Mo

Main category: cs.CV

TL;DR: 论文提出了一种基于最优传输理论的自动编码器隐写模型StegOT,通过多通道最优传输(MCOT)模块解决隐写中的模式崩溃问题,实现了封面图像与秘密图像的信息权衡。

Details Motivation: 现有隐写模型(如GAN和VAE)存在模式崩溃问题,导致隐写图像中封面与秘密图像信息不均衡,影响后续提取。为解决这一问题,本文提出结合最优传输理论的改进方法。

Contribution: 1. 提出StegOT模型,结合最优传输理论;2. 设计MCOT模块,将多峰特征分布转换为单峰分布,实现信息权衡;3. 实验证明模型能提升隐写和恢复图像质量。

Method: 基于自动编码器框架,引入MCOT模块对特征分布进行优化,利用最优传输理论实现封面与秘密图像的信息平衡,避免了模式崩溃问题。

Result: 实验表明,StegOT不仅解决了信息权衡问题,还显著提升了隐写和恢复图像的质量。

Insight: 最优传输理论在隐写中的应用能有效解决模式崩溃问题,MCOT模块的设计为特征分布优化提供了新思路。

Abstract: Image hiding is often referred to as steganography, which aims to hide a secret image in a cover image of the same resolution. Many steganography models are based on genera-tive adversarial networks (GANs) and variational autoencoders (VAEs). However, most existing models suffer from mode collapse. Mode collapse will lead to an information imbalance between the cover and secret images in the stego image and further affect the subsequent extraction. To address these challenges, this paper proposes StegOT, an autoencoder-based steganography model incorporating optimal transport theory. We designed the multiple channel optimal transport (MCOT) module to transform the feature distribution, which exhibits multiple peaks, into a single peak to achieve the trade-off of information. Experiments demonstrate that we not only achieve a trade-off between the cover and secret images but also enhance the quality of both the stego and recovery images. The source code will be released on https://github.com/Rss1124/StegOT.

[77] The Impact of Skin Tone Label Granularity on the Performance and Fairness of AI Based Dermatology Image Classification Models

Partha Shah,Durva Sankhe,Maariyah Rashid,Zakaa Khaled,Esther Puyol-Antón,Tiarna Lee,Maram Alqarni,Sweta Rai,Andrew P. King

Main category: cs.CV

TL;DR: 论文研究了Fitzpatrick皮肤色调(FST)标签的粒度对皮肤病变分类模型性能和公平性的影响,发现更粗的粒度对性能有负面影响,并建议采用更公平的替代量表。

Details Motivation: 现有FST量表在浅色皮肤类别中粒度更细,可能导致AI模型对深色皮肤表现不佳,引发公平性问题。

Contribution: 揭示了FST标签粒度对模型性能的影响,并呼吁在公平AI研究中放弃FST量表,采用更公平的替代方案。

Method: 训练多个模型,分别使用不同粒度的FST特定数据(如FST 1/2, 3/4, 5/6 vs. 1/2/3/4),对比其分类恶性和良性皮肤病变的性能。

Result: 使用三组FST特定数据训练的模型性能优于FST平衡数据训练的通用模型;FST信息粒度的降低(如合并1/2和3/4)对性能有负面影响。

Insight: FST标签粒度是影响模型性能的关键因素,现有FST量表的分类可能隐含人类偏见,需开发更公平的皮肤色调表示方法。

Abstract: Artificial intelligence (AI) models to automatically classify skin lesions from dermatology images have shown promising performance but also susceptibility to bias by skin tone. The most common way of representing skin tone information is the Fitzpatrick Skin Tone (FST) scale. The FST scale has been criticised for having greater granularity in its skin tone categories for lighter-skinned subjects. This paper conducts an investigation of the impact (on performance and bias) on AI classification models of granularity in the FST scale. By training multiple AI models to classify benign vs. malignant lesions using FST-specific data of differing granularity, we show that: (i) when training models using FST-specific data based on three groups (FST 1/2, 3/4 and 5/6), performance is generally better for models trained on FST-specific data compared to a general model trained on FST-balanced data; (ii) reducing the granularity of FST scale information (from 1/2 and 3/4 to 1/2/3/4) can have a detrimental effect on performance. Our results highlight the importance of the granularity of FST groups when training lesion classification models. Given the question marks over possible human biases in the choice of categories in the FST scale, this paper provides evidence for a move away from the FST scale in fair AI research and a transition to an alternative scale that better represents the diversity of human skin tones.

[78] Scaling Up Forest Vision with Synthetic Data

Yihang She,Andrew Blake,David Coomes,Srinivasan Keshav

Main category: cs.CV

TL;DR: 该论文提出了一种利用合成数据解决森林激光扫描中树木分割任务的方法,通过游戏引擎和物理模拟生成大规模合成数据集,显著减少了对真实标注数据的需求。

Details Motivation: 现有的公共3D森林数据规模不足以训练鲁棒的树木分割系统,而真实数据收集和标注成本高昂。受合成数据在其他领域(如自动驾驶)成功的启发,研究者探索了合成数据在树木分割任务中的应用。

Contribution: 论文的主要贡献包括:1)开发了一个新的合成数据生成流水线;2)生成了一个大规模、多样化的标注3D森林数据集;3)验证了合成数据在减少真实数据标注需求方面的有效性。

Method: 论文结合游戏引擎和基于物理的LiDAR模拟,开发了一个合成数据生成流水线。通过预训练合成数据并结合少量真实数据微调,实现了高性能的树木分割。

Result: 实验表明,仅需0.1公顷的真实数据微调后,预训练模型的性能可媲美全量真实数据训练的模型。关键成功因素包括物理模拟、数据多样性和规模。

Insight: 合成数据可以显著减少对真实标注数据的依赖,其有效性依赖于物理模拟的逼真度、数据多样性以及数据集规模。这一方法为未来3D森林视觉系统的鲁棒性提供了新方向。

Abstract: Accurate tree segmentation is a key step in extracting individual tree metrics from forest laser scans, and is essential to understanding ecosystem functions in carbon cycling and beyond. Over the past decade, tree segmentation algorithms have advanced rapidly due to developments in AI. However existing, public, 3D forest datasets are not large enough to build robust tree segmentation systems. Motivated by the success of synthetic data in other domains such as self-driving, we investigate whether similar approaches can help with tree segmentation. In place of expensive field data collection and annotation, we use synthetic data during pretraining, and then require only minimal, real forest plot annotation for fine-tuning. We have developed a new synthetic data generation pipeline to do this for forest vision tasks, integrating advances in game-engines with physics-based LiDAR simulation. As a result, we have produced a comprehensive, diverse, annotated 3D forest dataset on an unprecedented scale. Extensive experiments with a state-of-the-art tree segmentation algorithm and a popular real dataset show that our synthetic data can substantially reduce the need for labelled real data. After fine-tuning on just a single, real, forest plot of less than 0.1 hectare, the pretrained model achieves segmentations that are competitive with a model trained on the full scale real data. We have also identified critical factors for successful use of synthetic data: physics, diversity, and scale, paving the way for more robust 3D forest vision systems in the future. Our data generation pipeline and the resulting dataset are available at https://github.com/yihshe/CAMP3D.git.

[79] Beyond Sliders: Mastering the Art of Diffusion-based Image Manipulation

Yufei Tang,Daiheng Gao,Pingyu Wu,Wenbo Zhou,Bang Zhang,Weiming Zhang

Main category: cs.CV

TL;DR: 论文提出Beyond Sliders框架,结合GAN和扩散模型,提升真实世界图像的操控能力,通过细粒度文本与视觉引导优化图像质量。

Details Motivation: 现有方法(如概念滑块)在处理非AIGC图像(如真实世界图像)时表现不佳,需改进以实现更真实且灵活的操控。

Contribution: 提出Beyond Sliders框架,整合GAN与扩散模型,加入细粒度文本和视觉引导,显著提升图像质量和真实感。

Method: 结合对抗性训练的GAN和扩散模型,通过多模态(文本+视觉)细粒度指导优化图像。

Result: 实验验证了方法的鲁棒性和通用性,在多样化应用中表现优异。

Insight: 多模态细粒度引导是实现高质量图像操控的关键,扩散模型与GAN的结合能有效提升真实感。

Abstract: In the realm of image generation, the quest for realism and customization has never been more pressing. While existing methods like concept sliders have made strides, they often falter when it comes to no-AIGC images, particularly images captured in real world settings. To bridge this gap, we introduce Beyond Sliders, an innovative framework that integrates GANs and diffusion models to facilitate sophisticated image manipulation across diverse image categories. Improved upon concept sliders, our method refines the image through fine grained guidance both textual and visual in an adversarial manner, leading to a marked enhancement in image quality and realism. Extensive experimental validation confirms the robustness and versatility of Beyond Sliders across a spectrum of applications.

[80] Geometrically Constrained and Token-Based Probabilistic Spatial Transformers

Johann Schmidt,Sebastian Stober

Main category: cs.CV

TL;DR: 论文提出了一种概率化的、基于分量的空间变换器方法,通过几何约束和不确定性建模提高细粒度视觉分类的鲁棒性。

Details Motivation: 细粒度视觉分类(FGVC)对几何变化(如旋转、缩放、透视畸变)敏感,现有方法(如等变架构)计算开销大且假设空间受限。作者重新审视空间变换网络(STNs),提出一种更灵活的解决方案。

Contribution: 1.提出概率化的分量式空间变换器,建模几何变换的不确定性;2.引入几何约束和组件对齐损失;3.通过共享编码器实现高效的参数回归。

Method: 1.将仿射变换分解为旋转、缩放和剪切分量;2.用高斯变分后验建模每个分量;3.推理时基于采样进行规范化;4.设计组件对齐损失利用增强参数引导对齐。

Result: 在蛾类分类任务上,该方法显著优于其他STNs,表现出更高的鲁棒性。

Insight: 概率化建模和几何约束的结合能够有效应对几何变化的不确定性,同时保持计算效率和灵活性。

Abstract: Fine-grained visual classification (FGVC) remains highly sensitive to geometric variability, where objects appear under arbitrary orientations, scales, and perspective distortions. While equivariant architectures address this issue, they typically require substantial computational resources and restrict the hypothesis space. We revisit Spatial Transformer Networks (STNs) as a canonicalization tool for transformer-based vision pipelines, emphasizing their flexibility, backbone-agnostic nature, and lack of architectural constraints. We propose a probabilistic, component-wise extension that improves robustness. Specifically, we decompose affine transformations into rotation, scaling, and shearing, and regress each component under geometric constraints using a shared localization encoder. To capture uncertainty, we model each component with a Gaussian variational posterior and perform sampling-based canonicalization during inference.A novel component-wise alignment loss leverages augmentation parameters to guide spatial alignment. Experiments on challenging moth classification benchmarks demonstrate that our method consistently improves robustness compared to other STNs.

[81] MIS-LSTM: Multichannel Image-Sequence LSTM for Sleep Quality and Stress Prediction

Seongwan Park,Jieun Woo,Siheon Yang

Main category: cs.CV

TL;DR: MIS-LSTM是一种混合框架,结合CNN编码器和LSTM序列模型,用于从多模态生命日志数据中预测睡眠质量和压力。通过多通道图像和专门的一维CNN处理数据,并引入UALRE增强鲁棒性,实验性能优于基线和现有方法。

Details Motivation: 现有的睡眠质量和压力预测方法通常未能有效结合多模态数据和长时程依赖关系,MIS-LSTM旨在填补这一空白。

Contribution: 1. 提出MIS-LSTM,结合CNN和LSTM的多模态数据处理框架;2. 引入UALRE不确定性感知集成方法,提升预测鲁棒性;3. 实验验证了多通道图像、4小时块粒度和离散编码的有效性。

Method: 1. 将连续传感器数据分块为多通道图像,稀疏事件用1D-CNN编码;2. 使用卷积块注意力模块融合多模态数据;3. LSTM捕捉长时程依赖,UALRE集成低置信度预测。

Result: 在ETRI Lifelog Challenge数据集上,MIS-LSTM Macro-F1达0.615,UALRE集成后提升至0.647,优于基线模型。

Insight: 多通道图像优于堆叠垂直图像,4小时块粒度更有效,模态特定的离散编码对性能提升至关重要。

Abstract: This paper presents MIS-LSTM, a hybrid framework that joins CNN encoders with an LSTM sequence model for sleep quality and stress prediction at the day level from multimodal lifelog data. Continuous sensor streams are first partitioned into N-hour blocks and rendered as multi-channel images, while sparse discrete events are encoded with a dedicated 1D-CNN. A Convolutional Block Attention Module fuses the two modalities into refined block embeddings, which an LSTM then aggregates to capture long-range temporal dependencies. To further boost robustness, we introduce UALRE, an uncertainty-aware ensemble that overrides lowconfidence majority votes with high-confidence individual predictions. Experiments on the 2025 ETRI Lifelog Challenge dataset show that Our base MISLSTM achieves Macro-F1 0.615; with the UALRE ensemble, the score improves to 0.647, outperforming strong LSTM, 1D-CNN, and CNN baselines. Ablations confirm (i) the superiority of multi-channel over stacked-vertical imaging, (ii) the benefit of a 4-hour block granularity, and (iii) the efficacy of modality-specific discrete encoding.

[82] Contextualized Multimodal Lifelong Person Re-Identification in Hybrid Clothing States

Robert Long,Rongxin Jiang,Mingrui Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为CMLReID的框架,用于解决混合服装状态下的终身行人重识别任务,通过上下文感知语义提示(CASP)和自适应知识融合与投影(AKFP)方法,有效解决了服装变化和持续学习中的问题。

Details Motivation: 现实中的监控系统需要处理服装变化(CCReID)和持续学习(LReID)的挑战,但现有方法通常仅针对单一场景或独立问题。本文提出混合任务LReID-Hybrid,旨在开发一个同时支持同服装和服装变化的持续学习模型。

Contribution: 1) 提出CMLReID框架,结合上下文感知语义提示(CASP)和自适应知识融合与投影(AKFP);2) 解决服装变化和持续学习中的特征不对齐与遗忘问题;3) 在多种数据集上验证了方法的优越性。

Method: 1) CASP生成自适应提示,结合上下文对齐多粒度视觉线索与语义文本空间;2) AKFP通过双路径学习器和对服装状态感知的投影损失,生成鲁棒的同服装/服装变化原型。

Result: 实验表明,CMLReID在多种数据集上优于现有方法,具有强鲁棒性和泛化能力。

Insight: 结合CLIP的多模态能力与上下文感知机制,可以有效解决行人重识别中的服装变化与持续学习问题。

Abstract: Person Re-Identification (ReID) has several challenges in real-world surveillance systems due to clothing changes (CCReID) and the need for maintaining continual learning (LReID). Previous existing methods either develop models specifically for one application, which is mostly a same-cloth (SC) setting or treat CCReID as its own separate sub-problem. In this work, we will introduce the LReID-Hybrid task with the goal of developing a model to achieve both SC and CC while learning in a continual setting. Mismatched representations and forgetting from one task to the next are significant issues, we address this with CMLReID, a CLIP-based framework composed of two novel tasks: (1) Context-Aware Semantic Prompt (CASP) that generates adaptive prompts, and also incorporates context to align richly multi-grained visual cues with semantic text space; and (2) Adaptive Knowledge Fusion and Projection (AKFP) which produces robust SC/CC prototypes through the use of a dual-path learner that aligns features with our Clothing-State-Aware Projection Loss. Experiments performed on a wide range of datasets and illustrate that CMLReID outperforms all state-of-the-art methods with strong robustness and generalization despite clothing variations and a sophisticated process of sequential learning.

[83] Cross-Domain Attribute Alignment with CLIP: A Rehearsal-Free Approach for Class-Incremental Unsupervised Domain Adaptation

Kerun Mi,Guoliang Kang,Guangyu Li,Lin Zhao,Tao Zhou,Chen Gong

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的无排练方法,用于解决类增量无监督域适应(CI-UDA)问题。通过挖掘和保持域不变且类无关的“属性”知识,避免了传统方法需要存储历史样本和对齐不对称的问题。

Details Motivation: CI-UDA任务需要从标记的源域适应到无标记的目标域,且目标类的集合随时间步不重叠。传统方法需存储历史样本以缓解灾难性遗忘,且仅对齐共享类,导致内存增加和知识遗忘。本文旨在解决这些问题。

Contribution: 1. 提出了一种无排练的方法,避免了存储历史样本的需求;2. 引入了基于CLIP的类无关“属性”表示,用于跨域对齐;3. 设计了视觉注意力一致性和预测一致性的对齐机制。

Method: 1. 使用CLIP提取类无关的属性;2. 用“键值”对表示属性,键为视觉原型,值为文本提示;3. 维护两个属性字典(源域和目标域);4. 通过视觉注意力一致性和预测一致性进行跨域对齐。

Result: 在三个CI-UDA基准测试中,本文方法优于现有方法,并有效缓解了灾难性遗忘问题。

Insight: 通过挖掘域不变且类无关的属性知识,可以同时解决域偏移和灾难性遗忘问题,而无需依赖历史样本或复杂的对齐机制。

Abstract: Class-Incremental Unsupervised Domain Adaptation (CI-UDA) aims to adapt a model from a labeled source domain to an unlabeled target domain, where the sets of potential target classes appearing at different time steps are disjoint and are subsets of the source classes. The key to solving this problem lies in avoiding catastrophic forgetting of knowledge about previous target classes during continuously mitigating the domain shift. Most previous works cumbersomely combine two technical components. On one hand, they need to store and utilize rehearsal target sample from previous time steps to avoid catastrophic forgetting; on the other hand, they perform alignment only between classes shared across domains at each time step. Consequently, the memory will continuously increase and the asymmetric alignment may inevitably result in knowledge forgetting. In this paper, we propose to mine and preserve domain-invariant and class-agnostic knowledge to facilitate the CI-UDA task. Specifically, via using CLIP, we extract the class-agnostic properties which we name as “attribute”. In our framework, we learn a “key-value” pair to represent an attribute, where the key corresponds to the visual prototype and the value is the textual prompt. We maintain two attribute dictionaries, each corresponding to a different domain. Then we perform attribute alignment across domains to mitigate the domain shift, via encouraging visual attention consistency and prediction consistency. Through attribute modeling and cross-domain alignment, we effectively reduce catastrophic knowledge forgetting while mitigating the domain shift, in a rehearsal-free way. Experiments on three CI-UDA benchmarks demonstrate that our method outperforms previous state-of-the-art methods and effectively alleviates catastrophic forgetting. Code is available at https://github.com/RyunMi/VisTA.

[84] Synthetic Dataset Evaluation Based on Generalized Cross Validation

Zhihang Song,Dingyi Yao,Ruibo Ming,Lihui Peng,Danya Yao,Yi Zhang

Main category: cs.CV

TL;DR: 论文提出了一种基于广义交叉验证和领域迁移学习的合成数据集评估框架,引入两个关键指标量化合成数据的仿真质量和迁移质量。实验验证了框架的有效性。

Details Motivation: 随着合成数据生成技术的快速发展,评估合成数据质量成为关键研究方向,但目前缺乏通用的评估标准框架。

Contribution: 1. 提出了一种结合广义交叉验证和领域迁移学习的评估框架;2. 引入两个关键指标量化仿真质量和迁移质量;3. 实验验证了框架的适用性。

Method: 1. 在合成数据和真实数据上训练任务特定模型(如YOLOv5s);2. 构建交叉性能矩阵并归一化;3. 通过广义交叉验证矩阵量化领域迁移能力。

Result: 在Virtual KITTI数据集上的实验验证了框架的有效性,展示了其在评估合成数据保真度方面的优势。

Insight: 该框架提供了一种可扩展、可量化的解决方案,克服了传统方法的局限性,为合成数据的优化提供了指导。

Abstract: With the rapid advancement of synthetic dataset generation techniques, evaluating the quality of synthetic data has become a critical research focus. Robust evaluation not only drives innovations in data generation methods but also guides researchers in optimizing the utilization of these synthetic resources. However, current evaluation studies for synthetic datasets remain limited, lacking a universally accepted standard framework. To address this, this paper proposes a novel evaluation framework integrating generalized cross-validation experiments and domain transfer learning principles, enabling generalizable and comparable assessments of synthetic dataset quality. The framework involves training task-specific models (e.g., YOLOv5s) on both synthetic datasets and multiple real-world benchmarks (e.g., KITTI, BDD100K), forming a cross-performance matrix. Following normalization, a Generalized Cross-Validation (GCV) Matrix is constructed to quantify domain transferability. The framework introduces two key metrics. One measures the simulation quality by quantifying the similarity between synthetic data and real-world datasets, while another evaluates the transfer quality by assessing the diversity and coverage of synthetic data across various real-world scenarios. Experimental validation on Virtual KITTI demonstrates the effectiveness of our proposed framework and metrics in assessing synthetic data fidelity. This scalable and quantifiable evaluation solution overcomes traditional limitations, providing a principled approach to guide synthetic dataset optimization in artificial intelligence research.

[85] ROSGS: Relightable Outdoor Scenes With Gaussian Splatting

Lianjun Liao,Chunhui Zhang,Tong Wu,Henglei Lv,Bailin Deng,Lin Gao

Main category: cs.CV

TL;DR: ROSGS提出了一种基于高斯泼溅表示的两阶段流程,用于高效重建可重光照的户外场景,结合单目法线先验和混合光照模型,实现了高精度和高效率的重光照。

Details Motivation: 户外场景图像通常包含无边界场景和变化的光照条件,现有的NeRF或3DGS方法存在计算开销大和低频光照表示限制的问题。

Contribution: 1. 提出两阶段管道ROSGS,结合2DGS和高斯泼溅表示;2. 引入混合光照模型,分别处理阳光的高频方向和天空光的低频光照;3. 实现了高效且高精度的户外场景重光照。

Method: 1. 第一阶段使用单目法线先验重建几何(2DGS);2. 第二阶段通过混合光照模型分解纹理和光照,结合球形高斯函数和球谐系数。

Result: ROSGS在定量和定性比较中均达到最先进的性能,展示了更高的重光照精度和渲染效率。

Insight: 结合几何先验和混合光照模型可以高效解决户外场景的光照分解问题,同时避免神经网络的过高计算开销。

Abstract: Image data captured outdoors often exhibit unbounded scenes and unconstrained, varying lighting conditions, making it challenging to decompose them into geometry, reflectance, and illumination. Recent works have focused on achieving this decomposition using Neural Radiance Fields (NeRF) or the 3D Gaussian Splatting (3DGS) representation but remain hindered by two key limitations: the high computational overhead associated with neural networks of NeRF and the use of low-frequency lighting representations, which often result in inefficient rendering and suboptimal relighting accuracy. We propose ROSGS, a two-stage pipeline designed to efficiently reconstruct relightable outdoor scenes using the Gaussian Splatting representation. By leveraging monocular normal priors, ROSGS first reconstructs the scene’s geometry with the compact 2D Gaussian Splatting (2DGS) representation, providing an efficient and accurate geometric foundation. Building upon this reconstructed geometry, ROSGS then decomposes the scene’s texture and lighting through a hybrid lighting model. This model effectively represents typical outdoor lighting by employing a spherical Gaussian function to capture the directional, high-frequency components of sunlight, while learning a radiance transfer function via Spherical Harmonic coefficients to model the remaining low-frequency skylight comprehensively. Both quantitative metrics and qualitative comparisons demonstrate that ROSGS achieves state-of-the-art performance in relighting outdoor scenes and highlight its ability to deliver superior relighting accuracy and rendering efficiency.

[86] Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations

Yifan Lu,Ziqi Zhang,Chunfeng Yuan,Jun Gao,Congxuan Zhang,Xiaojuan Qi,Bing Li,Weiming Hu

Main category: cs.CV

TL;DR: 该论文提出了一种名为APASI的新方法,通过自注入幻觉来缓解大视觉语言模型的幻觉问题,无需外部依赖,并在多个基准测试中表现出色。

Details Motivation: 现有的大视觉语言模型(LVLM)存在严重的幻觉问题,即模型生成的响应与视觉输入不一致。现有的缓解方法依赖外部标注或辅助模型,增加了成本且限制了持续改进。因此,作者提出了一种无需外部依赖的自注入幻觉方法。

Contribution: 论文的主要贡献是提出了APASI方法,该方法通过自注入幻觉生成偏好对,并结合课程学习和迭代对齐训练策略,实现了幻觉的有效缓解,且性能优于依赖外部资源的方法。

Method: APASI的核心方法是利用目标LVLM自注入幻觉,生成带有不同偏好级别的响应对。通过这种方法模拟真实幻觉模式,并结合迭代对齐训练和课程学习,逐步提升模型的性能。

Result: 实验结果表明,APASI在六个基准测试中有效缓解了幻觉问题,并且性能优于依赖外部资源的方法,展示了其有效性和泛化能力。

Insight: APASI的关键洞察是通过自注入幻觉生成高质量的偏好对,避免了对外部资源的依赖,同时结合课程学习和迭代训练,实现了模型的持续改进。

Abstract: Large Vision-Language Models (LVLMs) suffer from serious hallucination problems, where the model-generated responses are inconsistent with the visual inputs. Existing hallucination mitigation methods are mainly based on preference alignment and require external human annotations or auxiliary models for preference data collection, which increase costs and limit sustainable improvement. To tackle these challenges, we propose Autonomous Preference Alignment via Self-Injection (APASI), a novel and generalizable method that mitigates hallucinations without external dependencies. APASI leverages the target LVLM to self-inject hallucinations into a generated response, creating a pair of responses with varying preference levels. During the self-injection process, the dis-preferred response is generated based on three key observations of hallucinations, ensuring it simulates real hallucination patterns. This fidelity offers an accurate learning signal for hallucination mitigation. Moreover, APASI incorporates an iterative alignment training strategy combined with curriculum learning to periodically update the preference data with increasing challenge, enabling stable and continuous enhancement of the LVLM. Extensive experiments across six benchmarks show that APASI not only effectively mitigates hallucinations for three baseline models but also achieves comparable or even superior performance to alignment-based methods with external dependency, thereby demonstrating its effectiveness and generalization capability. The code is available at https://github.com/davidluciolu/APASI.

[87] Leveraging Geometric Priors for Unaligned Scene Change Detection

Ziling Liu,Ziwei Chen,Mingqi Gao,Jinyu Yang,Feng Zheng

Main category: cs.CV

TL;DR: 该论文提出了一种利用几何先验的免训练框架,用于解决未对齐场景变化检测中的核心挑战,如视觉重叠的可靠识别、鲁棒的对应关系建立和显式遮挡检测。通过结合几何基础模型的先验和视觉基础模型的强大表示,该方法在视角不一致的情况下实现了更优的性能。

Details Motivation: 未对齐场景变化检测(SCD)因视角变化导致2D视觉匹配失效,现有方法仅依赖2D视觉线索,缺乏几何推理,限制了多视角知识的泛化能力,难以可靠处理重叠和遮挡。

Contribution: 1.首次利用几何基础模型的几何先验解决未对齐SCD的核心问题;2.提出免训练框架,整合几何先验与视觉基础模型表示;3.在多个数据集上验证了方法的鲁棒性和优越性。

Method: 通过几何基础模型提取几何先验,结合视觉基础模型的表示,构建免训练框架,直接用于未对齐场景的变化检测,无需额外训练。

Result: 在PSCD、ChangeSim和PASLCD数据集上的实验表明,该方法性能优越且鲁棒。

Insight: 几何先验的引入显式解决了未对齐SCD中因视角变化引发的视觉匹配失效问题,为类似任务提供了新的思路。

Abstract: Unaligned Scene Change Detection aims to detect scene changes between image pairs captured at different times without assuming viewpoint alignment. To handle viewpoint variations, current methods rely solely on 2D visual cues to establish cross-image correspondence to assist change detection. However, large viewpoint changes can alter visual observations, causing appearance-based matching to drift or fail. Additionally, supervision limited to 2D change masks from small-scale SCD datasets restricts the learning of generalizable multi-view knowledge, making it difficult to reliably identify visual overlaps and handle occlusions. This lack of explicit geometric reasoning represents a critical yet overlooked limitation. In this work, we are the first to leverage geometric priors from a Geometric Foundation Model to address the core challenges of unaligned SCD, including reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. Building on these priors, we propose a training-free framework that integrates them with the powerful representations of a visual foundation model to enable reliable change detection under viewpoint misalignment. Through extensive evaluation on the PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves superior and robust performance. Our code will be released at https://github.com/ZilingLiu/GeoSCD.

[88] Toward Next-generation Medical Vision Backbones: Modeling Finer-grained Long-range Visual Dependency

Mingyuan Meng

Main category: cs.CV

TL;DR: 该博士研究探索了医学图像计算中细粒度长距离视觉依赖建模的重要性,提出了创新的Transformer和MLP模型,发现MLP在高分辨率医学图像中表现优于Transformer和CNN。

Details Motivation: 医学图像计算需要同时捕捉全局长距离上下文和局部细微视觉特征,但传统CNN有局部性限制,Transformer计算复杂度高,难以处理高分辨率特征。MLP虽高效但未在医学图像领域广泛研究。

Contribution: 1) 创新性地将Transformer应用于像素级和图像级医学视觉任务;2) 开创性地开发了基于MLP的视觉模型,用于医学图像中的细粒度长距离依赖建模;3) 揭示了MLP在高分辨率医学特征中的优势。

Method: 1) 使用Transformer处理医学图像任务;2) 提出基于MLP的模型,专注于高分辨率医学图像的细粒度长距离依赖建模;3) 通过实验验证了方法的有效性。

Result: 实验表明,长距离依赖建模对医学图像计算至关重要,MLP在建模高分辨率医学特征的细粒度依赖方面优于Transformer和CNN,显著提升了多种医学视觉任务的性能。

Insight: MLP在医学图像中具有高效和性能优势,可以作为下一代医学视觉主干网络的范式,尤其是在需要捕捉丰富解剖/病理细节的高分辨率任务中。

Abstract: Medical Image Computing (MIC) is a broad research topic covering both pixel-wise (e.g., segmentation, registration) and image-wise (e.g., classification, regression) vision tasks. Effective analysis demands models that capture both global long-range context and local subtle visual characteristics, necessitating fine-grained long-range visual dependency modeling. Compared to Convolutional Neural Networks (CNNs) that are limited by intrinsic locality, transformers excel at long-range modeling; however, due to the high computational loads of self-attention, transformers typically cannot process high-resolution features (e.g., full-scale image features before downsampling or patch embedding) and thus face difficulties in modeling fine-grained dependency among subtle medical image details. Concurrently, Multi-layer Perceptron (MLP)-based visual models are recognized as computation/memory-efficient alternatives in modeling long-range visual dependency but have yet to be widely investigated in the MIC community. This doctoral research advances deep learning-based MIC by investigating effective long-range visual dependency modeling. It first presents innovative use of transformers for both pixel- and image-wise medical vision tasks. The focus then shifts to MLPs, pioneeringly developing MLP-based visual models to capture fine-grained long-range visual dependency in medical images. Extensive experiments confirm the critical role of long-range dependency modeling in MIC and reveal a key finding: MLPs provide feasibility in modeling finer-grained long-range dependency among higher-resolution medical features containing enriched anatomical/pathological details. This finding establishes MLPs as a superior paradigm over transformers/CNNs, consistently enhancing performance across various medical vision tasks and paving the way for next-generation medical vision backbones.

[89] Dual Band Video Thermography Near Ambient Conditions

Sriram Narayanan,Mani Ramanagopal,Srinivasa G. Narasimhan

Main category: cs.CV

TL;DR: 该论文提出了一种使用双波段热成像相机分离视频中反射光和发射光成分的方法,适用于近环境温度条件下,解决了传统假设中的局限性。

Details Motivation: 在近环境条件下,传统热成像研究假设反射或发射光成分占主导或恒定,但实际中两者动态变化且量级相近,因此需要新方法。

Contribution: 提出了首个基于双波段热成像相机的视频反射与发射光分离方法,并建立了相应的图像形成模型和估计算法。

Method: 通过双波段热成像相机捕捉视频,建立图像形成模型,利用算法分离反射和发射光成分,估计表面发射率和动态温度。

Result: 在多种材料的精确校准发射率下进行了定量评估,并在复杂日常场景中展示了定性结果。

Insight: 近环境条件下的热成像中,反射和发射光成分的动态变化需同时考虑,双波段方法为计算机视觉应用提供了新工具。

Abstract: Long-wave infrared radiation captured by a thermal camera consists of two components: (a) light from the environment reflected or transmitted by a surface, and (b) light emitted by the surface after undergoing heat transport through the object and exchanging heat with the surrounding environment. Separating these components is essential for understanding object properties such as emissivity, temperature, reflectance and shape. Previous thermography studies often assume that only one component is dominant (e.g., in welding) or that the second component is constant and can be subtracted. However, in near-ambient conditions, which are most relevant to computer vision applications, both components are typically comparable in magnitude and vary over time. We introduce the first method that separates reflected and emitted components of light in videos captured by two thermal cameras with different spectral sensitivities. We derive a dual-band thermal image formation model and develop algorithms to estimate the surface’s emissivity and its time-varying temperature while isolating a dynamic background. We quantitatively evaluate our approach using carefully calibrated emissivities for a range of materials and show qualitative results on complex everyday scenes, such as a glass filled with hot liquid and people moving in the background.

[90] Beyond Instance Consistency: Investigating View Diversity in Self-supervised Learning

Huaiyuan Qin,Muli Yang,Siyuan Hu,Peng Hu,Yu Zhang,Chen Gong,Hongyuan Zhu

Main category: cs.CV

TL;DR: 该论文探讨自监督学习(SSL)中实例一致性假设的局限性,提出即使正对缺乏严格实例一致性,SSL仍能学习有效表征,并发现视角多样性对性能有重要影响。

Details Motivation: 传统的SSL方法假设相同图像的不同视角可作为正对,但该假设在非标志性数据中失效,因为不同视角可能包含不同对象或语义信息。研究旨在评估SSL在实例一致性不成立时的有效性。

Contribution: 1. 证明SSL可在缺乏严格实例一致性的情况下学习有意义的表征;2. 发现视角多样性对下游任务性能的影响存在最优范围;3. 引入地球移动距离(EMD)量化视角间的互信息。

Method: 通过消融实验研究实例一致性不成立时SSL的表现,分析不同视角多样性(如零重叠或较小裁剪尺度)对性能的影响,并使用EMD作为互信息度量工具。

Result: 实验表明,适度的视角多样性可提升分类和密集预测任务的表现,但过度多样性会降低效果。EMD值适中的情况下,SSL学习效果最佳。

Insight: 视角多样性是SSL设计的重要考虑因素,过多或过少多样性均不利于模型表现,因此需在框架设计中寻找平衡。

Abstract: Self-supervised learning (SSL) conventionally relies on the instance consistency paradigm, assuming that different views of the same image can be treated as positive pairs. However, this assumption breaks down for non-iconic data, where different views may contain distinct objects or semantic information. In this paper, we investigate the effectiveness of SSL when instance consistency is not guaranteed. Through extensive ablation studies, we demonstrate that SSL can still learn meaningful representations even when positive pairs lack strict instance consistency. Furthermore, our analysis further reveals that increasing view diversity, by enforcing zero overlapping or using smaller crop scales, can enhance downstream performance on classification and dense prediction tasks. However, excessive diversity is found to reduce effectiveness, suggesting an optimal range for view diversity. To quantify this, we adopt the Earth Mover’s Distance (EMD) as an estimator to measure mutual information between views, finding that moderate EMD values correlate with improved SSL learning, providing insights for future SSL framework design. We validate our findings across a range of settings, highlighting their robustness and applicability on diverse data sources.

[91] Promoting Shape Bias in CNNs: Frequency-Based and Contrastive Regularization for Corruption Robustness

Robin Narsingh Ranabhat,Longwei Wang,Amit Kumar Patel,KC santosh

Main category: cs.CV

TL;DR: 论文通过引入频率调节与对比学习正则化策略,促使CNN更依赖形状而非纹理特征,提升对抗破坏的鲁棒性。

Details Motivation: CNN在图像分类中过度依赖局部纹理特征,而非全局形状特征,导致对常见破坏的鲁棒性不足。人类视觉系统更依赖形状,因此作者希望通过正则化策略调整CNN的偏向性。

Contribution: 提出了两种正则化方法:(1) 通过辅助损失强制特征在原始与低频输入间的一致性,减少对高频纹理的依赖;(2) 利用监督对比学习构建以形状相关的特征空间。

Method: 1. 频率调节:通过低频过滤的输入与原始输入的对比损失,抑制高频纹理依赖。2. 监督对比学习:通过对比损失强化类内形状一致性特征。

Result: 在CIFAR-10-C基准测试中,两种方法均提升了模型对破坏的鲁棒性,且未损害原始分类性能。

Insight: 损失层面的正则化可有效引导CNN学习更多形状感知的特征,从而更接近人类视觉系统的处理方式。

Abstract: Convolutional Neural Networks (CNNs) excel at image classification but remain vulnerable to common corruptions that humans handle with ease. A key reason for this fragility is their reliance on local texture cues rather than global object shapes – a stark contrast to human perception. To address this, we propose two complementary regularization strategies designed to encourage shape-biased representations and enhance robustness. The first introduces an auxiliary loss that enforces feature consistency between original and low-frequency filtered inputs, discouraging dependence on high-frequency textures. The second incorporates supervised contrastive learning to structure the feature space around class-consistent, shape-relevant representations. Evaluated on the CIFAR-10-C benchmark, both methods improve corruption robustness without degrading clean accuracy. Our results suggest that loss-level regularization can effectively steer CNNs toward more shape-aware, resilient representations.

[92] GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration

Wan Xu,Feng Zhu,Yihan Zeng,Yuanfan Guo,Ming Liu,Hang Xu,Wangmeng Zuo

Main category: cs.CV

TL;DR: GLaVE-Cap提出了一種全球-局部對齊的視頻字幕生成框架,解決了現有局部到全球範式中細節不足和上下文不一致的問題,並通過TrackFusion和CaptionBridge模塊實現了高效的視頻理解。

Details Motivation: 現有的局部到全球視頻字幕範式存在細節不足和上下文不一致的問題,需要一種更有效的框架來生成全面且連貫的視頻描述。

Contribution: 1. 提出GLaVE-Cap框架,包含TrackFusion和CaptionBridge模塊;2. 構建了GLaVE-Bench和GLaVE-1.2M數據集;3. 在多個基準測試中達到SOTA性能。

Method: TrackFusion利用視覺專家獲取跨幀視覺提示並採用雙流結構生成局部字幕;CaptionBridge通過全局上下文引導局部字幕生成並總結成連貫的全球字幕。

Result: 在四個基準測試中表現最佳,模塊有效性及數據集貢獻得到驗證。

Insight: 全球-局部交互和視覺專家整合對生成詳細且一致的視頻字幕至關重要。

Abstract: Video detailed captioning aims to generate comprehensive video descriptions to facilitate video understanding. Recently, most efforts in the video detailed captioning community have been made towards a local-to-global paradigm, which first generates local captions from video clips and then summarizes them into a global caption. However, we find this paradigm leads to less detailed and contextual-inconsistent captions, which can be attributed to (1) no mechanism to ensure fine-grained captions, and (2) weak interaction between local and global captions. To remedy the above two issues, we propose GLaVE-Cap, a Global-Local aligned framework with Vision Expert integration for Captioning, which consists of two core modules: TrackFusion enables comprehensive local caption generation, by leveraging vision experts to acquire cross-frame visual prompts, coupled with a dual-stream structure; while CaptionBridge establishes a local-global interaction, by using global context to guide local captioning, and adaptively summarizing local captions into a coherent global caption. Besides, we construct GLaVE-Bench, a comprehensive video captioning benchmark featuring 5X more queries per video than existing benchmarks, covering diverse visual dimensions to facilitate reliable evaluation. We further provide a training dataset GLaVE-1.2M containing 16K high-quality fine-grained video captions and 1.2M related question-answer pairs. Extensive experiments on four benchmarks show that our GLaVE-Cap achieves state-of-the-art performance. Besides, the ablation studies and student model analyses further validate the effectiveness of the proposed modules and the contribution of GLaVE-1.2M to the video understanding community. The source code, model weights, benchmark, and dataset will be open-sourced.

[93] No Modality Left Behind: Dynamic Model Generation for Incomplete Medical Data

Christoph Fürböck,Paul Weiser,Branko Mitic,Philipp Seeböck,Thomas Helbich,Georg Langs

Main category: cs.CV

TL;DR: 该论文提出了一种基于超网络的动态生成模型方法,用于处理医学多模态数据中的缺失问题,能够在任何模态组合下进行训练和推断,显著提高了准确性和鲁棒性。

Details Motivation: 在真实临床环境中,医学多模态数据通常存在部分缺失问题。现有方法要么丢弃缺失样本,要么需要填补数据,限制了模型的鲁棒性和泛化能力。

Contribution: 提出了一种超网络方法,动态生成适应可用模态的任务模型,支持在所有样本上进行训练和推断,无需填补或丢弃数据。

Method: 使用超网络预测任务模型的参数,动态生成适应不同模态组合的模型,支持训练和推断时的灵活模态配置。

Result: 该方法在25%完整度的数据集上优于现有方法,绝对准确率提升最高达8%。

Insight: 动态模型生成可有效解决多模态数据的缺失问题,为医学影像分析提供了一种高效且通用的解决方案。

Abstract: In real world clinical environments, training and applying deep learning models on multi-modal medical imaging data often struggles with partially incomplete data. Standard approaches either discard missing samples, require imputation or repurpose dropout learning schemes, limiting robustness and generalizability. To address this, we propose a hypernetwork-based method that dynamically generates task-specific classification models conditioned on the set of available modalities. Instead of training a fixed model, a hypernetwork learns to predict the parameters of a task model adapted to available modalities, enabling training and inference on all samples, regardless of completeness. We compare this approach with (1) models trained only on complete data, (2) state of the art channel dropout methods, and (3) an imputation-based method, using artificially incomplete datasets to systematically analyze robustness to missing modalities. Results demonstrate superior adaptability of our method, outperforming state of the art approaches with an absolute increase in accuracy of up to 8% when trained on a dataset with 25% completeness (75% of training data with missing modalities). By enabling a single model to generalize across all modality configurations, our approach provides an efficient solution for real-world multi-modal medical data analysis.

[94] On the Skinning of Gaussian Avatars

Nikolaos Zioulis,Nikolaos Kotarelas,Georgios Albanis,Spyridon Thermos,Anargyros Chatzitofis

Main category: cs.CV

TL;DR: 该论文提出了一种基于高斯辐射场的人体化身体重建方法,通过加权旋转混合(利用四元数平均)解决了线性混合蒙皮对非线性高斯旋转属性不适用的问题。

Details Motivation: 尽管基于高斯splatting的方法在训练和渲染速度上优于神经辐射场,但其线性混合蒙皮技术无法正确处理高斯的非线性旋转属性,导致动画效果不佳。

Contribution: 论文的主要贡献是提出了一种加权旋转混合方法,利用四元数平均技术,简化了基于顶点的高斯模型,使其能够高效动画化并与任何渲染引擎兼容。

Method: 通过四元数平均技术,加权旋转混合方法避免了线性混合蒙皮对非线性高斯旋转的无效处理,同时无需训练额外的模型。

Result: 该方法能够生成更自然的动画效果,同时保持训练和渲染的高效性。

Insight: 解决高斯非线性旋转属性的关键在于利用四元数的数学性质,而非依赖复杂的补救措施或额外模型的预测。

Abstract: Radiance field-based methods have recently been used to reconstruct human avatars, showing that we can significantly downscale the systems needed for creating animated human avatars. Although this progress has been initiated by neural radiance fields, their slow rendering and backward mapping from the observation space to the canonical space have been the main challenges. With Gaussian splatting overcoming both challenges, a new family of approaches has emerged that are faster to train and render, while also straightforward to implement using forward skinning from the canonical to the observation space. However, the linear blend skinning required for the deformation of the Gaussians does not provide valid results for their non-linear rotation properties. To address such artifacts, recent works use mesh properties to rotate the non-linear Gaussian properties or train models to predict corrective offsets. Instead, we propose a weighted rotation blending approach that leverages quaternion averaging. This leads to simpler vertex-based Gaussians that can be efficiently animated and integrated in any engine by only modifying the linear blend skinning technique, and using any Gaussian rasterizer.

[95] Disentanglement of Biological and Technical Factors via Latent Space Rotation in Clinical Imaging Improves Disease Pattern Discovery

Jeanny Pan,Philipp Seeböck,Christoph Fürböck,Svitlana Pochepnia,Jennifer Straub,Lucian Beer,Helmut Prosch,Georg Langs

Main category: cs.CV

TL;DR: 该论文提出了一种通过潜在空间旋转来分离医学影像中生物因素和技术因素的框架,提升了疾病相关模式的发现能力。

Details Motivation: 医学影像数据的模式识别受技术因素(如设备和成像参数)干扰,阻碍了生物学意义的表示学习和分类。

Contribution: 提出了一种事后潜在空间旋转方法,显著分离生物和技术因素,提高了跨域数据的聚类一致性和生存预测能力。

Method: 通过主动学习技术因素的域偏移,对潜在空间进行旋转,实现生物和技术因素的解耦。

Result: 在真实临床数据上,聚类一致性指标显著提升(ARI +19.01%,NMI +16.85%,Dice +12.39%),优于四种先进方法。

Insight: 无监督解耦框架可用于多中心医学影像数据,助力生物标志物发现和预后分析。

Abstract: Identifying new disease-related patterns in medical imaging data with the help of machine learning enlarges the vocabulary of recognizable findings. This supports diagnostic and prognostic assessment. However, image appearance varies not only due to biological differences, but also due to imaging technology linked to vendors, scanning- or re- construction parameters. The resulting domain shifts impedes data representation learning strategies and the discovery of biologically meaningful cluster appearances. To address these challenges, we introduce an approach to actively learn the domain shift via post-hoc rotation of the data latent space, enabling disentanglement of biological and technical factors. Results on real-world heterogeneous clinical data showcase that the learned disentangled representation leads to stable clusters representing tissue-types across different acquisition settings. Cluster consistency is improved by +19.01% (ARI), +16.85% (NMI), and +12.39% (Dice) compared to the entangled representation, outperforming four state-of-the-art harmonization methods. When using the clusters to quantify tissue composition on idiopathic pulmonary fibrosis patients, the learned profiles enhance Cox survival prediction. This indicates that the proposed label-free framework facilitates biomarker discovery in multi-center routine imaging data. Code is available on GitHub https://github.com/cirmuw/latent-space-rotation-disentanglement.

[96] MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder

Ayhan Can Erdur,Christian Beischl,Daniel Scholz,Jiazhen Pan,Benedikt Wiestler,Daniel Rueckert,Jan C Peeken

Main category: cs.CV

TL;DR: 论文提出了一种多模态掩码自编码器(MultiMAE)方法,用于处理脑部MRI中的缺失输入序列,通过多任务重建和跨序列推理提升模型的鲁棒性和泛化能力。

Details Motivation: 医疗影像数据中常见输入序列缺失的问题,传统深度学习方法依赖完整数据,难以应对这一挑战。

Contribution: 1. 提出了一种基于MultiMAE的多模态掩码自编码器范式;2. 通过跨序列推理处理缺失输入;3. 在分割和分类任务中显著优于基线方法。

Method: 将每个MRI序列作为独立模态输入,使用后期融合的Transformer编码器整合多模态信息,并为每个模态设计独立解码器进行多任务重建。

Result: 在缺失输入的情况下,下游分割和分类任务的Dice分数和MCC分别提升了10.1和0.46。

Insight: 多模态掩码自编码器预训练策略能有效学习模态间的丰富表示,提升模型对缺失数据的处理能力。

Abstract: Missing input sequences are common in medical imaging data, posing a challenge for deep learning models reliant on complete input data. In this work, inspired by MultiMAE [2], we develop a masked autoencoder (MAE) paradigm for multi-modal, multi-task learning in 3D medical imaging with brain MRIs. Our method treats each MRI sequence as a separate input modality, leveraging a late-fusion-style transformer encoder to integrate multi-sequence information (multi-modal) and individual decoder streams for each modality for multi-task reconstruction. This pretraining strategy guides the model to learn rich representations per modality while also equipping it to handle missing inputs through cross-sequence reasoning. The result is a flexible and generalizable encoder for brain MRIs that infers missing sequences from available inputs and can be adapted to various downstream applications. We demonstrate the performance and robustness of our method against an MAE-ViT baseline in downstream segmentation and classification tasks, showing absolute improvement of $10.1$ overall Dice score and $0.46$ MCC over the baselines with missing input sequences. Our experiments demonstrate the strength of this pretraining strategy. The implementation is made available.

[97] Beyond Frame-wise Tracking: A Trajectory-based Paradigm for Efficient Point Cloud Tracking

BaiChen Fan,Sifan Zhou,Jian Li,Shibo Zhao,Muqing Cao,Qin Wang

Main category: cs.CV

TL;DR: 本文提出了一个基于轨迹的点云跟踪新范式TrajTrack,通过从历史边界框轨迹中隐式学习运动连续性,显著提升了3D单目标跟踪的精度和效率。

Details Motivation: 现有的点云跟踪方法分为帧间运动估计和基于序列的范式,前者效率高但缺乏长期时间上下文,后者鲁棒但计算成本高。本文旨在解决这一矛盾。

Contribution: 提出了一种轻量的轨迹跟踪范式TrajTrack,利用历史轨迹隐式学习运动连续性,无需额外点云输入,显著提升了跟踪性能和效率。

Method: TrajTrack首先生成快速的显式运动提案,然后通过隐式运动建模模块预测未来轨迹,并修正初始提案。

Result: 在NuScenes数据集上,TrajTrack以56 FPS的速度将跟踪精度提升了4.48%,达到新的SOTA性能。

Insight: 轨迹信息可以高效地捕捉运动连续性,显著提升稀疏或遮挡场景的跟踪鲁棒性。

Abstract: LiDAR-based 3D single object tracking (3D SOT) is a critical task in robotics and autonomous systems. Existing methods typically follow frame-wise motion estimation or a sequence-based paradigm. However, the two-frame methods are efficient but lack long-term temporal context, making them vulnerable in sparse or occluded scenes, while sequence-based methods that process multiple point clouds gain robustness at a significant computational cost. To resolve this dilemma, we propose a novel trajectory-based paradigm and its instantiation, TrajTrack. TrajTrack is a lightweight framework that enhances a base two-frame tracker by implicitly learning motion continuity from historical bounding box trajectories alone-without requiring additional, costly point cloud inputs. It first generates a fast, explicit motion proposal and then uses an implicit motion modeling module to predict the future trajectory, which in turn refines and corrects the initial proposal. Extensive experiments on the large-scale NuScenes benchmark show that TrajTrack achieves new state-of-the-art performance, dramatically improving tracking precision by 4.48% over a strong baseline while running at 56 FPS. Besides, we also demonstrate the strong generalizability of TrajTrack across different base trackers. Video is available at https://www.bilibili.com/video/BV1ahYgzmEWP.

[98] Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision

Tianyao Sun,Dawei Xiang,Tianqi Ding,Xiang Fang,Yijiashun Qi,Zunduo Zhao

Main category: cs.CV

TL;DR: The paper proposes FusionNet, an end-to-end framework for infrared and visible image fusion (IVIF), featuring a modality-aware attention mechanism and target-aware loss to enhance semantic preservation and perceptual quality.

Details Motivation: The task of IVIF aims to integrate structural and textural cues from different spectral domains, but existing methods lack explicit modeling of inter-modality interaction and semantic consistency in task-critical regions.

Contribution: 1. Introduces a modality-aware attention mechanism for dynamic feature contribution adjustment. 2. Proposes a pixel-wise alpha blending module for fine-grained fusion. 3. Formulates a target-aware loss leveraging weak ROI supervision for semantic preservation.

Method: FusionNet uses a modality-aware attention mechanism and adaptive pixel-wise alpha blending, with a target-aware loss supervised by weak ROI annotations to guide semantic preservation.

Result: Experiments on the M3FD dataset show superior performance in semantic preservation, perceptual quality, and interpretability.

Insight: The framework highlights the importance of content-aware fusion and semantic consistency for downstream tasks like object detection and scene understanding.

Abstract: Infrared and visible image fusion (IVIF) is a fundamental task in multi-modal perception that aims to integrate complementary structural and textural cues from different spectral domains. In this paper, we propose FusionNet, a novel end-to-end fusion framework that explicitly models inter-modality interaction and enhances task-critical regions. FusionNet introduces a modality-aware attention mechanism that dynamically adjusts the contribution of infrared and visible features based on their discriminative capacity. To achieve fine-grained, interpretable fusion, we further incorporate a pixel-wise alpha blending module, which learns spatially-varying fusion weights in an adaptive and content-aware manner. Moreover, we formulate a target-aware loss that leverages weak ROI supervision to preserve semantic consistency in regions containing important objects (e.g., pedestrians, vehicles). Experiments on the public M3FD dataset demonstrate that FusionNet generates fused images with enhanced semantic preservation, high perceptual quality, and clear interpretability. Our framework provides a general and extensible solution for semantic-aware multi-modal image fusion, with benefits for downstream tasks such as object detection and scene understanding.

[99] Multiple Instance Learning Framework with Masked Hard Instance Mining for Gigapixel Histopathology Image Analysis

Wenhao Tang,Sheng Huang,Heng Fang,Fengtao Zhou,Bo Liu,Qingshan Liu

Main category: cs.CV

TL;DR: 该论文提出了一种名为MHIM-MIL的新型多实例学习(MIL)框架,通过掩码硬实例挖掘技术提高计算病理学中对全视野数字切片(WSI)的分析能力,显著优于现有方法。

Details Motivation: 在计算病理学中,全视野数字切片(WSI)只有少量组织为阳性,传统MIL方法通过注意力机制筛选显著实例,但容易忽略难以分类的硬实例,影响模型判别边界建模的准确性。

Contribution: 1. 提出MHIM-MIL框架,结合Siamese结构和一致性约束挖掘硬实例。2. 引入动量教师模型和随机掩码技术,以隐式方式选择多样化的硬实例。3. 设计了全局循环网络避免丢失关键特征。4. 在12个基准测试中验证了方法的优越性。

Method: 1. 使用Siamese结构,通过学生-教师模型一致性约束挖掘硬实例。2. 动量教师模型生成掩码,屏蔽显著实例以隐式选择硬实例。3. 采用大规模随机掩码和全局循环网络确保多样性和特征保留。4. 通过指数移动平均更新教师模型,优化稳定性。

Result: MHIM-MIL在癌症诊断、亚型分析和生存预测任务中表现优异,12个基准测试中均超越最新方法。

Insight: 1. 硬实例对提升模型判别能力至关重要。2. 动量教师与随机掩码结合可实现硬实例的多样化挖掘。3. 学生-教师框架结合一致性约束能有效优化模型性能。

Abstract: Digitizing pathological images into gigapixel Whole Slide Images (WSIs) has opened new avenues for Computational Pathology (CPath). As positive tissue comprises only a small fraction of gigapixel WSIs, existing Multiple Instance Learning (MIL) methods typically focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting challenging ones. Recent studies have shown that hard examples are crucial for accurately modeling discriminative boundaries. Applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which utilizes a Siamese structure with a consistency constraint to explore the hard instances. Using a class-aware instance probability, MHIM-MIL employs a momentum teacher to mask salient instances and implicitly mine hard instances for training the student model. To obtain diverse, non-redundant hard instances, we adopt large-scale random masking while utilizing a global recycle network to mitigate the risk of losing key features. Furthermore, the student updates the teacher using an exponential moving average, which identifies new hard instances for subsequent training iterations and stabilizes optimization. Experimental results on cancer diagnosis, subtyping, survival analysis tasks, and 12 benchmarks demonstrate that MHIM-MIL outperforms the latest methods in both performance and efficiency. The code is available at: https://github.com/DearCaat/MHIM-MIL.

[100] SFGNet: Semantic and Frequency Guided Network for Camouflaged Object Detection

Dezhen Wang,Haixiang Zhao,Xiang Shen,Sheng Miao

Main category: cs.CV

TL;DR: SFGNet提出了一种结合语义提示和多频特征的网络,用于伪装目标检测,通过MBFM和ISEB模块提升复杂背景和模糊边界的处理能力。

Details Motivation: 现有伪装目标检测方法忽视语义差异和细粒度频率特征,SFGNet通过融合语义和频域信息弥补这一缺陷。

Contribution: 1. 提出SFGNet网络,结合语义和频域特征;2. 设计MBFM模块提升频域分析能力;3. 引入ISEB模块增强结构完整性。

Method: 1. 利用语义提示和频域特征联合建模;2. 通过MBFM模块多频分解处理复杂背景;3. ISEB模块优化预测中的边界细节。

Result: 在三个基准数据集上性能显著优于现有方法。

Insight: 语义与频域信息的结合能有效提升伪装目标的检测和边界分割能力。

Abstract: Camouflaged object detection (COD) aims to segment objects that blend into their surroundings. However, most existing studies overlook the semantic differences among textual prompts of different targets as well as fine-grained frequency features. In this work, we propose a novel Semantic and Frequency Guided Network (SFGNet), which incorporates semantic prompts and frequency-domain features to capture camouflaged objects and improve boundary perception. We further design Multi-Band Fourier Module(MBFM) to enhance the ability of the network in handling complex backgrounds and blurred boundaries. In addition, we design an Interactive Structure Enhancement Block (ISEB) to ensure structural integrity and boundary details in the predictions. Extensive experiments conducted on three COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches. The core code of the model is available at the following link: https://github.com/winter794444/SFGNetICASSP2026.

[101] How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

Weiming Li,Yan Shao,Jing Yang,Yujing Lu,Ling Zhong,Yuhan Wang,Manni Duan

Main category: cs.CV

TL;DR: 论文提出三种零样本辅助推理方法,通过提供显式空间线索(如轴、网格和标记交点)作为输入图像的一部分,释放视觉语言模型(VLM)在GUI接地任务中的潜力,显著提升了性能。

Details Motivation: 通用视觉语言模型(VLMs)在GUI接地任务中表现不佳,因其缺乏针对性的优化。作者发现VLMs虽具备潜在的空间理解能力,但在输出显式坐标时表现不足,且现有微调方法数据标注成本高昂。

Contribution: 提出三种零样本辅助推理方法(轴、网格和标记交点),无需额外数据标注,显著提升了VLMs在GUI接地任务中的性能。

Method: 通过将显式空间线索(如轴、网格和标记交点)嵌入输入图像,激发VLMs的隐式空间理解能力,实现零样本GUI接地。

Result: 在四个GUI接地基准测试和七种开源及商业VLMs上验证了方法的有效性,大幅提升了性能。

Insight: 显式空间线索可以低成本释放VLMs的潜在能力,为GUI接地任务提供了一种无需微调的高效解决方案。

Abstract: Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy, and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. The evaluation results demonstrate that the proposed methods substantially improve the performance of GUI grounding.

[102] MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment

Yanyun Pu,Kehan Li,Zeyi Huang,Zhijie Zhong,Kaixiang Yang

Main category: cs.CV

TL;DR: 该论文提出了MVQA-68K数据集,包含68,000多个视频的质量标注,涵盖七个维度,并通过链式思考提供解释性,显著提升了VQA任务的性能。

Details Motivation: 传统视频质量评估(VQA)方法通常仅提供单一数值评分,缺乏全面性和解释性。为解决这一问题,作者提出MVQA-68K数据集,以多维度标注和因果推理增强VQA的全面性与可解释性。

Contribution: 主要贡献是MVQA-68K数据集,覆盖七个视频质量维度并提供详细推理链,显著提升了多模态大语言模型在VQA任务中的性能,并在多个基准测试中达到SOTA。

Method: 数据集包含68K视频,标注涵盖七个质量维度(如美学、动态性等),并通过链式思考提高解释性。实验中结合推理过程训练,增强了零样本泛化能力。

Result: 在内部测试集和公开基准(如LSVQ-test、LIVE-VQC)上达到最优性能,同时显式推理训练大幅提升零样本泛化能力。

Insight: 多维度标注和解释性推理链的结合是提升VQA任务性能的关键,显式推理训练有助于零样本泛化能力的增强。

Abstract: With the rapid advancement of video generation models such as Sora, video quality assessment (VQA) is becoming increasingly crucial for selecting high-quality videos from large-scale datasets used in pre-training. Traditional VQA methods, typically producing single numerical scores, often lack comprehensiveness and interpretability. To address these challenges, we introduce MVQA-68K, a novel multi-dimensional VQA dataset comprising over 68,000 carefully annotated videos, covering seven essential quality dimensions: overall aesthetics, camera movement, dynamic degree, texture detail, composition, visual quality, and factual consistency. Each annotation includes detailed chain-of-thought reasoning to facilitate interpretability and comprehensive understanding. Extensive experiments demonstrate that MVQA-68K significantly enhances the performance of various multimodal large language models (MLLMs) on the VQA task, achieving state-of-the-art results not only on our internal test set (Fig.1) but also on public benchmarks including LSVQ-test, LSVQ-1080p, and LIVE-VQC. Meantime, incorporating explicit reasoning process during VQA training substantially boosts the zero-shot generalization. Code and dataset will be available at github: https://github.com/Controller01-ai/MVQA-68K

[103] Disentangling Content from Style to Overcome Shortcut Learning: A Hybrid Generative-Discriminative Learning Framework

Siming Fu,Sijun Dong,Xiaoliang Meng

Main category: cs.CV

TL;DR: 该论文提出了一种混合生成-判别学习框架(HyGDL),通过内容-风格解耦来解决自监督学习中存在的捷径学习问题,从而提升模型在未见域上的泛化能力。

Details Motivation: 当前自监督学习(SSL)模型容易依赖表面特征(如纹理)而非内在结构进行学习,这种捷径学习问题限制了模型在未见域上的表现。尽管现有方法尝试通过特征对齐或分离来缓解,但未从根本上改变依赖捷径的学习机制。

Contribution: 论文的主要贡献是提出了HyGDL框架,通过内容-风格解耦和不变性预训练原则,显式地分离内容和风格特征,从而避免捷径学习。

Method: HyGDL基于单编码器,通过向量投影分析定义风格为表示的与风格无关内容正交的分量,并在输入中系统变化偏置(如风格)以学习不变内容。

Result: 实验验证了HyGDL在生成式和判别式方法中均能有效解决捷径学习问题,提升模型在未见域上的表现。

Insight: 通过显式内容-风格解耦和不变性预训练,可以系统性地解决捷径学习问题,这一方法可推广至其他自监督学习任务中。

Abstract: Despite the remarkable success of Self-Supervised Learning (SSL), its generalization is fundamentally hindered by Shortcut Learning, where models exploit superficial features like texture instead of intrinsic structure. We experimentally verify this flaw within the generative paradigm (e.g., MAE) and argue it is a systemic issue also affecting discriminative methods, identifying it as the root cause of their failure on unseen domains. While existing methods often tackle this at a surface level by aligning or separating domain-specific features, they fail to alter the underlying learning mechanism that fosters shortcut dependency. To address this at its core, we propose HyGDL (Hybrid Generative-Discriminative Learning Framework), a hybrid framework that achieves explicit content-style disentanglement. Our approach is guided by the Invariance Pre-training Principle: forcing a model to learn an invariant essence by systematically varying a bias (e.g., style) at the input while keeping the supervision signal constant. HyGDL operates on a single encoder and analytically defines style as the component of a representation that is orthogonal to its style-invariant content, derived via vector projection.

[104] DUAL-VAD: Dual Benchmarks and Anomaly-Focused Sampling for Video Anomaly Detection

Seoik Jung,Taekyung Song,Joshua Jordan Daniel,JinYoung Lee,SungJun Lee

Main category: cs.CV

TL;DR: 这篇论文提出了DUAL-VAD框架,通过基于softmax的帧分配策略和双基准测试(图像级和视频级),优化了视频异常检测的性能。实验表明该方法在UCF-Crime数据集上优于均匀和随机采样基线。

Details Motivation: 现有视频异常检测(VAD)基准测试仅关注帧级或视频级任务,限制了模型泛化能力的全面评估。因此,作者希望设计一种更全面的框架。

Contribution: 1. 提出了一种基于softmax的帧分配策略,优先选择异常密集的片段,同时保持全视频覆盖。2. 构建了两个互补的基准测试:图像级(帧级推理)和视频级(时间定位和异常评分)。

Method: 1. 使用softmax策略对帧进行采样,平衡异常密集片段和全视频覆盖。2. 设计双基准测试分别评估帧级和视频级任务。

Result: 在UCF-Crime数据集上表现优于均匀和随机采样基线,帧级和视频级任务均有所提升。

Insight: 异常密集片段的优先采样能显著提升VAD性能;双基准测试设计有助于更全面地评估模型能力。

Abstract: Video Anomaly Detection (VAD) is critical for surveillance and public safety. However, existing benchmarks are limited to either frame-level or video-level tasks, restricting a holistic view of model generalization. This work first introduces a softmax-based frame allocation strategy that prioritizes anomaly-dense segments while maintaining full-video coverage, enabling balanced sampling across temporal scales. Building on this process, we construct two complementary benchmarks. The image-based benchmark evaluates frame-level reasoning with representative frames, while the video-based benchmark extends to temporally localized segments and incorporates an abnormality scoring task.Experiments on UCF-Crime demonstrate improvements at both the frame and video levels, and ablation studies confirm clear advantages of anomaly-focused sampling over uniform and random baselines.

[105] A Controllable 3D Deepfake Generation Framework with Gaussian Splatting

Wending Liu,Siyun Liang,Huy H. Nguyen,Isao Echizen

Main category: cs.CV

TL;DR: 提出一种基于3D高斯溅射的可控3D深度伪造生成框架,支持多视角一致渲染、精确表情控制和背景无缝融合,显著优于传统2D方法。

Details Motivation: 传统2D深度伪造方法在几何一致性和新视角泛化性上存在局限,亟需结合3D建模技术以实现更真实的伪造效果。

Contribution: 结合参数化头部模型与动态高斯表示,实现多视角一致的3D深度伪造;提出分离头背景高斯和使用2D引导优化的方法,解决了基于点表示的编辑难题。

Method: 使用3D高斯溅射生成动态高斯表示,分离头与背景高斯,预训练2D引导优化面部区域,添加修复模块增强极端姿态下的视觉一致性。

Result: 在身份保持、姿态和表情一致性上与先进2D方法相当,多视角渲染质量和3D一致性显著更优。

Insight: 3D高斯溅射技术为深度伪造提供了新的可控性和沉浸感,但也揭示了其可能被滥用于视觉攻击的威胁。

Abstract: We propose a novel 3D deepfake generation framework based on 3D Gaussian Splatting that enables realistic, identity-preserving face swapping and reenactment in a fully controllable 3D space. Compared to conventional 2D deepfake approaches that suffer from geometric inconsistencies and limited generalization to novel view, our method combines a parametric head model with dynamic Gaussian representations to support multi-view consistent rendering, precise expression control, and seamless background integration. To address editing challenges in point-based representations, we explicitly separate the head and background Gaussians and use pre-trained 2D guidance to optimize the facial region across views. We further introduce a repair module to enhance visual consistency under extreme poses and expressions. Experiments on NeRSemble and additional evaluation videos demonstrate that our method achieves comparable performance to state-of-the-art 2D approaches in identity preservation, as well as pose and expression consistency, while significantly outperforming them in multi-view rendering quality and 3D consistency. Our approach bridges the gap between 3D modeling and deepfake synthesis, enabling new directions for scene-aware, controllable, and immersive visual forgeries, revealing the threat that emerging 3D Gaussian Splatting technique could be used for manipulation attacks.

[106] IS-Diff: Improving Diffusion-Based Inpainting with Better Initial Seed

Yongzhe Lyu,Yu Wu,Yutian Lin,Bo Du

Main category: cs.CV

TL;DR: IS-Diff提出了一种基于扩散模型的无训练图像修复方法,通过优化初始噪声生成和谐且一致的结果。

Details Motivation: 传统扩散模型在图像修复中使用随机初始噪声可能导致语义不匹配,修复结果与未遮罩区域不一致。

Contribution: 1. 提出IS-Diff,采用未遮罩区域的分布生成和谐初始噪声;2. 设计动态选择性优化机制调整初始先验强度。

Method: 1. 从未遮罩区域采样初始噪声;2. 动态检测中间潜在空间的不和谐修复并调整优化强度。

Result: 在CelebA-HQ、ImageNet和Places2数据集上,IS-Diff在多项指标上优于现有方法。

Insight: 初始噪声的分布对扩散模型修复结果的一致性有重要影响,优化初始噪声能显著提升修复质量。

Abstract: Diffusion models have shown promising results in free-form inpainting. Recent studies based on refined diffusion samplers or novel architectural designs led to realistic results and high data consistency. However, random initialization seed (noise) adopted in vanilla diffusion process may introduce mismatched semantic information in masked regions, leading to biased inpainting results, e.g., low consistency and low coherence with the other unmasked area. To address this issue, we propose the Initial Seed refined Diffusion Model (IS-Diff), a completely training-free approach incorporating distributional harmonious seeds to produce harmonious results. Specifically, IS-Diff employs initial seeds sampled from unmasked areas to imitate the masked data distribution, thereby setting a promising direction for the diffusion procedure. Moreover, a dynamic selective refinement mechanism is proposed to detect severe unharmonious inpaintings in intermediate latent and adjust the strength of our initialization prior dynamically. We validate our method on both standard and large-mask inpainting tasks using the CelebA-HQ, ImageNet, and Places2 datasets, demonstrating its effectiveness across all metrics compared to state-of-the-art inpainting methods.

[107] WeatherBench: A Real-World Benchmark Dataset for All-in-One Adverse Weather Image Restoration

Qiyuan Guan,Qianfeng Yang,Xiang Chen,Tianyu Song,Guiyue Jin,Jiyu Jin

Main category: cs.CV

TL;DR: 论文提出了一个真实世界的多天气图像修复基准数据集WeatherBench,解决了现有合成数据集在分辨率、风格和领域特性上的不一致问题。

Details Motivation: 现有的一体化图像修复方法主要基于合成的单天气退化数据集,但这些数据集在分辨率、风格和领域特性上存在显著差异,导致领域差距大,阻碍统一模型的开发和公平评估。缺乏大规模的真实世界一体化天气修复数据集也限制了该领域的发展。

Contribution: 提出了一个真实世界的一体化恶劣天气图像修复基准数据集WeatherBench,包含多种天气条件下的精确对齐的退化与干净图像对,支持监督学习和严格评估。

Method: 数据集通过收集真实世界中的多种天气条件下的图像对构建,涵盖了雨、雪、雾等天气以及多样的户外场景和光照条件。

Result: 论文通过多种任务专用、任务通用和一体化修复方法在数据集上进行了全面实验,验证了其有效性。

Insight: WeatherBench为研究真实世界场景下的一体化图像修复提供了宝贵的基础,解决了现有合成数据集的领域差距问题。

Abstract: Existing all-in-one image restoration approaches, which aim to handle multiple weather degradations within a single framework, are predominantly trained and evaluated using mixed single-weather synthetic datasets. However, these datasets often differ significantly in resolution, style, and domain characteristics, leading to substantial domain gaps that hinder the development and fair evaluation of unified models. Furthermore, the lack of a large-scale, real-world all-in-one weather restoration dataset remains a critical bottleneck in advancing this field. To address these limitations, we present a real-world all-in-one adverse weather image restoration benchmark dataset, which contains image pairs captured under various weather conditions, including rain, snow, and haze, as well as diverse outdoor scenes and illumination settings. The resulting dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of task-specific, task-general, and all-in-one restoration methods on our dataset. Our dataset offers a valuable foundation for advancing robust and practical all-in-one image restoration in real-world scenarios. The dataset has been publicly released and is available at https://github.com/guanqiyuan/WeatherBench.

[108] DTGen: Generative Diffusion-Based Few-Shot Data Augmentation for Fine-Grained Dirty Tableware Recognition

Lifei Hao,Yue Cheng,Baoqi Huang,Bing Jia,Xuandong Zhao

Main category: cs.CV

TL;DR: DTGen是一种基于生成扩散模型的少样本数据增强方案,专为细粒度脏餐具识别设计,通过LoRA实现高效领域专业化,生成多样化脏图像,并通过CLIP过滤确保数据质量。

Details Motivation: 现有的餐具清洗方法受限于粗粒度分类和少样本数据稀缺,难以满足工业化需求。DTGen旨在通过生成技术解决这一问题。

Contribution: 1. 提出基于扩散模型的少样本数据增强方案DTGen;2. 实现高效领域适应性与多样数据生成;3. 提供轻量级部署策略支持嵌入式设备。

Method: 1. 使用LoRA实现扩散模型的高效领域专业化;2. 通过结构化提示生成多样化脏图像;3. 利用CLIP进行跨模态过滤确保数据质量。

Result: 在极少真实样本条件下,DTGen能合成高质量样本,显著提升分类器性能,支持细粒度识别。

Insight: DTGen展示了生成式AI在工业视觉少样本问题中的潜力,并为智能餐具清洗和食品安全监测提供了可行路径。

Abstract: Intelligent tableware cleaning is a critical application in food safety and smart homes, but existing methods are limited by coarse-grained classification and scarcity of few-shot data, making it difficult to meet industrialization requirements. We propose DTGen, a few-shot data augmentation scheme based on generative diffusion models, specifically designed for fine-grained dirty tableware recognition. DTGen achieves efficient domain specialization through LoRA, generates diverse dirty images via structured prompts, and ensures data quality through CLIP-based cross-modal filtering. Under extremely limited real few-shot conditions, DTGen can synthesize virtually unlimited high-quality samples, significantly improving classifier performance and supporting fine-grained dirty tableware recognition. We further elaborate on lightweight deployment strategies, promising to transfer DTGen’s benefits to embedded dishwashers and integrate with cleaning programs to intelligently regulate energy consumption and detergent usage. Research results demonstrate that DTGen not only validates the value of generative AI in few-shot industrial vision but also provides a feasible deployment path for automated tableware cleaning and food safety monitoring.

[109] MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

Feilong Chen,Yijiang Liu,Yi Huang,Hao Wang,Miren Tian,Ya-Qi Yu,Minghui Liao,Jihao Wu

Main category: cs.CV

TL;DR: MindVL是一种基于Ascend NPU的高效多模态大语言模型,通过原生分辨率Vision Transformers处理图像,避免了固定分辨率分块带来的信息损失,并通过三阶段训练及优化框架实现高性能。

Details Motivation: 为了解决在多模态任务中固定分辨率分块导致的细节损失问题,并探索在Ascend NPU上高效训练大模型的可行性。

Contribution: 1. 提出MindVL模型,采用原生分辨率ViT处理图像;2. 开发Mindspeed-MLLM框架优化Ascend NPU上的训练;3. 通过三阶段训练和测试时分辩率搜索提升性能。

Method: 1. 使用原生分辨率ViT;2. 三阶段训练(预热、多任务训练、监督指令微调);3. 分布式训练框架Mindspeed-MLLM;4. 测试时分辩率搜索及权重平均。

Result: MindVL在通用多模态理解和文档/表格理解任务中性能与Qwen2.5-VL相当,但仅使用后者1/10的训练数据。OCR评估中也表现领先。

Insight: 原生分辨率ViT和多阶段训练策略可显著提升模型性能;Ascend NPU上的分布式训练框架是实现高效训练的关键。

Abstract: We propose MindVL, a multimodal large langauge model trained on Ascend NPUs. Similar to Qwen2.5-VL, MindVL adopts native-resolution Vision Transformers, which enables it to process images at their original variable resolutions. This design avoids the degradation caused by fixed-resolution tiling while preserving fine-grained details and global layouts, which is crucial for visually dense content such as complex charts and diagrams. To ensure the smooth training of MindVL on Ascend NPUs, we develop Mindspeed-MLLM, a distributed multimodal training framework tailored for Ascend NPUs. To maintain training accuracy, we implement equivalent replacements for certain operators. MindVL undergoes a three-phase training process, namely the warm-up phase, multitask training phase, and supervised instruction tuning phase, to gradually enhance its capabilities. This process starts with basic visual and multimodal pre-training, followed by large-scale multiask trainging and instruction tuning. We also adopt multimodal data packaging and hybrid parallelism techniques, which significantly improve end-to-end training speed. To further boost model performance, we specifically introduce test-time resolution search and model weight averaging. Notably, despite using about 1/10 of the training data required by Qwen2.5-VL, MindVL achieves performance on par with Qwen2.5-VL in evaluations of general multimodal understanding and document/table comprehension. Beyond overall scores, MindVL also delivers leading performance in OCR assessments.

[110] RouteExtract: A Modular Pipeline for Extracting Routes from Paper Maps

Bjoern Kremser,Yusuke Matsui

Main category: cs.CV

TL;DR: 论文提出了一个模块化流程RouteExtract,用于从纸质地图中提取可导航路线,结合了地理配准、U-Net分割、图构建和路由引擎优化,实现高效的路线提取。

Details Motivation: 纸质地图包含数字导航应用缺失的精心策划的路径和本地相关信息,但缺乏数字化支持。论文旨在通过自动化流程提取这些路线,以支持GPS导航。

Contribution: 主要贡献包括:1) 提出了一个端到端的模块化流程;2) 结合了U-Net分割和路由引擎优化;3) 在多种地图风格上验证了方法的鲁棒性。

Method: 方法包括四个模块:1) 地理配准对齐地图;2) U-Net二元分割提取路线;3) 构建路线图结构;4) 通过路由引擎迭代优化结果。

Result: 实验表明,该方法能从不同风格的地图中有效提取路线网络,生成的GPS路线适合实际使用。

Insight: 纸质地图的数字化与自动化路径提取为导航应用提供了新思路,尤其适用于缺乏数字数据的场景。

Abstract: Paper maps remain widely used for hiking and sightseeing because they contain curated trails and locally relevant annotations that are often missing from digital navigation applications such as Google Maps. We propose a pipeline to extract navigable trails from scanned maps, enabling their use in GPS-based navigation. Our method combines georeferencing, U-Net-based binary segmentation, graph construction, and an iterative refinement procedure using a routing engine. We evaluate the full end-to-end pipeline as well as individual components, showing that the approach can robustly recover trail networks from diverse map styles and generate GPS routes suitable for practical use.

[111] IMD: A 6-DoF Pose Estimation Benchmark for Industrial Metallic Objects

Ruimin Ma,Sebastian Zudaire,Zhen Li,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出了一个专为工业应用设计的6自由度姿态估计基准数据集IMD,填补了现有数据集主要针对日常物体而忽略工业金属物体的不足。

Details Motivation: 工业场景中的物体多为金属材质、纹理缺失且高度反光,现有6D姿态估计数据集主要针对日常物体,难以推广到工业场景。

Contribution: 提出工业金属物体数据集IMD,包含45种真实比例的工业组件,支持视频目标分割、6D姿态跟踪和单次6D姿态估计任务。

Method: 使用RGB-D相机在自然室内光照和多样化物体排列条件下采集数据,并评估现有先进模型(如XMem、SAM2等)。

Result: 评估表明IMD比现有家庭物体数据集更具挑战性,为工业机器人场景的算法开发提供了基准。

Insight: 工业金属物体的反射性和无纹理特性增加了姿态估计难度,需要针对此类场景开发更鲁棒的算法。

Abstract: Object 6DoF (6D) pose estimation is essential for robotic perception, especially in industrial settings. It enables robots to interact with the environment and manipulate objects. However, existing benchmarks on object 6D pose estimation primarily use everyday objects with rich textures and low-reflectivity, limiting model generalization to industrial scenarios where objects are often metallic, texture-less, and highly reflective. To address this gap, we propose a novel dataset and benchmark namely \textit{Industrial Metallic Dataset (IMD)}, tailored for industrial applications. Our dataset comprises 45 true-to-scale industrial components, captured with an RGB-D camera under natural indoor lighting and varied object arrangements to replicate real-world conditions. The benchmark supports three tasks, including video object segmentation, 6D pose tracking, and one-shot 6D pose estimation. We evaluate existing state-of-the-art models, including XMem and SAM2 for segmentation, and BundleTrack and BundleSDF for pose estimation, to assess model performance in industrial contexts. Evaluation results show that our industrial dataset is more challenging than existing household object datasets. This benchmark provides the baseline for developing and comparing segmentation and pose estimation algorithms that better generalize to industrial robotics scenarios.

[112] A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications

Hongyuan Zhang,Yuheng Wu,Mingyang Zhao,Zhiwei Chen,Rebecca Li,Fei Zhu,Haohan Zhao,Xiaohua Yuan,Meng Yang,Chunli Qiu,Xiang Cong,Haiyan Chen,Lina Luan,Randolph H. L. Wong,Huai Liao,Colin A Graham,Shi Chang,Guowei Tao,Dong Yi,Zhen Lei,Nassir Navab,Sebastien Ourselin,Jiebo Luo,Hongbin Liu,Gaofeng Meng

Main category: cs.CV

TL;DR: 该论文提出了一个名为EchoCare的超声基础模型,通过自监督学习在公开的大规模数据集EchoCareData上训练,展现了在多种超声临床任务中的优异性能。

Details Motivation: 现实临床环境中大型标注数据集稀缺,且任务特定模型的泛化能力有限,阻碍了超声临床AI模型的发展。

Contribution: 1) 提出了EchoCareData数据集,包含450万超声图像,覆盖多中心、多设备和多民族数据;2) 设计了带分层分类器的EchoCare模型,结合像素级和表示级特征学习;3) 在10个超声任务中表现优于现有方法。

Method: 采用自监督学习训练EchoCare模型,引入分层分类器以实现全局解剖上下文和局部超声特征的联合学习。

Result: EchoCare在疾病诊断、病灶分割、器官检测等10个任务中均优于现有方法,且代码和预训练模型已公开。

Insight: 通过大规模多源数据训练和分层分类器设计,可以显著提升超声AI模型的泛化能力和临床适用性。

Abstract: Artificial intelligence (AI) that can effectively learn ultrasound representations by integrating multi-source data holds significant promise for advancing clinical care. However, the scarcity of large labeled datasets in real-world clinical environments and the limited generalizability of task-specific models have hindered the development of generalizable clinical AI models for ultrasound applications. In this study, we present EchoCare, a novel ultrasound foundation model for generalist clinical use, developed via self-supervised learning on our curated, publicly available, large-scale dataset EchoCareData. EchoCareData comprises 4.5 million ultrasound images, sourced from over 23 countries across 5 continents and acquired via a diverse range of distinct imaging devices, thus encompassing global cohorts that are multi-center, multi-device, and multi-ethnic. Unlike prior studies that adopt off-the-shelf vision foundation model architectures, we introduce a hierarchical classifier into EchoCare to enable joint learning of pixel-level and representation-level features, capturing both global anatomical contexts and local ultrasound characteristics. With minimal training, EchoCare outperforms state-of-the-art comparison models across 10 representative ultrasound benchmarks of varying diagnostic difficulties, spanning disease diagnosis, lesion segmentation, organ detection, landmark prediction, quantitative regression, imaging enhancement and report generation. The code and pretrained model are publicly released, rendering EchoCare accessible for fine-tuning and local adaptation, supporting extensibility to additional applications. EchoCare provides a fully open and generalizable foundation model to boost the development of AI technologies for diverse clinical ultrasound applications.

[113] MSMA: Multi-Scale Feature Fusion For Multi-Attribute 3D Face Reconstruction From Unconstrained Images

Danling Cao

Main category: cs.CV

TL;DR: 论文提出了一种名为MSMA的多尺度特征融合方法,用于从无约束图像中进行多属性的3D人脸重建,解决了现有方法在多属性和多尺度特征提取上的不足,并在多个数据集上达到了SOTA性能。

Details Motivation: 现有基于学习的3D人脸重建方法通常依赖大量3D数据,且在多属性和多尺度特征提取上表现不佳,导致重建结果不完整或不准确。

Contribution: 提出了MSMA框架,结合多尺度特征融合和多属性学习,并引入大核注意力模块,提升了跨尺度特征提取的精确性。

Method: 通过多尺度特征融合和多属性学习框架,结合大核注意力模块,从单张2D图像中估计3D人脸参数。

Result: 在MICC Florence、Facewarehouse和自定义数据集上的实验表明,MSMA在多种挑战性条件下达到或超越了当前SOTA方法的性能。

Insight: 多尺度特征融合结合多属性学习是提升3D人脸重建精度的有效途径,尤其在无约束环境下表现出色。

Abstract: Reconstructing 3D face from a single unconstrained image remains a challenging problem due to diverse conditions in unconstrained environments. Recently, learning-based methods have achieved notable results by effectively capturing complex facial structures and details across varying conditions. Consequently, many existing approaches employ projection-based losses between generated and input images to constrain model training. However, learning-based methods for 3D face reconstruction typically require substantial amounts of 3D facial data, which is difficult and costly to obtain. Consequently, to reduce reliance on labeled 3D face datasets, many existing approaches employ projection-based losses between generated and input images to constrain model training. Nonetheless, despite these advancements, existing approaches frequently struggle to capture detailed and multi-scale features under diverse facial attributes and conditions, leading to incomplete or less accurate reconstructions. In this paper, we propose a Multi-Scale Feature Fusion with Multi-Attribute (MSMA) framework for 3D face reconstruction from unconstrained images. Our method integrates multi-scale feature fusion with a focus on multi-attribute learning and leverages a large-kernel attention module to enhance the precision of feature extraction across scales, enabling accurate 3D facial parameter estimation from a single 2D image. Comprehensive experiments on the MICC Florence, Facewarehouse and custom-collect datasets demonstrate that our approach achieves results on par with current state-of-the-art methods, and in some instances, surpasses SOTA performance across challenging conditions.

[114] Seg2Track-SAM2: SAM2-based Multi-object Tracking and Segmentation for Zero-shot Generalization

Diogo Mendonça,Tiago Barros,Cristiano Premebida,Urbano J. Nunes

Main category: cs.CV

TL;DR: 该论文提出Seg2Track-SAM2框架,结合预训练目标检测器和SAM2,实现零样本泛化的多目标跟踪与分割,提升了关联精度和内存效率。

Details Motivation: 动态环境中自主系统需要鲁棒的多目标跟踪(MOT)能力。虽然SAM2在视频分割中展示了零样本泛化能力,但其在MOTS任务中的应用受限于身份管理和内存效率。

Contribution: 1. 提出Seg2Track-SAM2框架,集成预训练检测器和SAM2;2. 设计了Seg2Track模块用于轨迹管理;3. 引入滑动窗口内存策略,降低内存消耗75%;4. 零样本泛化能力,无需微调。

Method: 1. 结合预训练检测器和SAM2进行零样本分割;2. 通过Seg2Track模块实现轨迹初始化和强化;3. 滑动窗口内存策略优化资源使用。

Result: 在KITTI MOT和MOTS基准测试中达到SOTA性能,关联精度(AssA)创下新纪录,内存消耗降低75%。

Insight: 结合基础模型和轻量化模块可实现高效零样本MOTS,滑动窗口策略为资源受限场景提供了实用性方案。

Abstract: Autonomous systems require robust Multi-Object Tracking (MOT) capabilities to operate reliably in dynamic environments. MOT ensures consistent object identity assignment and precise spatial delineation. Recent advances in foundation models, such as SAM2, have demonstrated strong zero-shot generalization for video segmentation, but their direct application to MOTS (MOT+Segmentation) remains limited by insufficient identity management and memory efficiency. This work introduces Seg2Track-SAM2, a framework that integrates pre-trained object detectors with SAM2 and a novel Seg2Track module to address track initialization, track management, and reinforcement. The proposed approach requires no fine-tuning and remains detector-agnostic. Experimental results on KITTI MOT and KITTI MOTS benchmarks show that Seg2Track-SAM2 achieves state-of-the-art (SOTA) performance, ranking fourth overall in both car and pedestrian classes on KITTI MOTS, while establishing a new benchmark in association accuracy (AssA). Furthermore, a sliding-window memory strategy reduces memory usage by up to 75% with negligible performance degradation, supporting deployment under resource constraints. These results confirm that Seg2Track-SAM2 advances MOTS by combining robust zero-shot tracking, enhanced identity preservation, and efficient memory utilization. The code is available at https://github.com/hcmr-lab/Seg2Track-SAM2

[115] FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning

Haodong Chen,Haojian Huang,XinXiang Yin,Dian Shao

Main category: cs.CV

TL;DR: FineQuest是一个无需训练的视频问答框架,通过双模式推理(反应式推理和慎思推理)提升体育视频问答的性能,并结合了多模态体育知识场景图SSGraph和两个新基准Gym-QA和Diving-QA。

Details Motivation: 通用大语言模型在体育视频问答领域面临知识鸿沟和复杂推理的挑战,需要更高效的适应性和领域知识增强。

Contribution: 1. 提出首个无需训练的双模式推理框架FineQuest;2. 引入多模态体育知识场景图SSGraph;3. 发布两个新基准Gym-QA和Diving-QA。

Method: 结合反应式推理处理简单问题,慎思推理处理复杂问题,并通过SSGraph增强跨模态领域知识使用。

Result: 在Gym-QA、Diving-QA和SPORTU数据集上达到SOTA性能,同时保持通用视频问答能力。

Insight: 双模式推理和领域知识场景图的结合是提升复杂领域视频问答性能的有效途径。

Abstract: Video Question Answering (VideoQA) based on Large Language Models (LLMs) has shown potential in general video understanding but faces significant challenges when applied to the inherently complex domain of sports videos. In this work, we propose FineQuest, the first training-free framework that leverages dual-mode reasoning inspired by cognitive science: i) Reactive Reasoning for straightforward sports queries and ii) Deliberative Reasoning for more complex ones. To bridge the knowledge gap between general-purpose models and domain-specific sports understanding, FineQuest incorporates SSGraph, a multimodal sports knowledge scene graph spanning nine sports, which encodes both visual instances and domain-specific terminology to enhance reasoning accuracy. Furthermore, we introduce two new sports VideoQA benchmarks, Gym-QA and Diving-QA, derived from the FineGym and FineDiving datasets, enabling diverse and comprehensive evaluation. FineQuest achieves state-of-the-art performance on these benchmarks as well as the existing SPORTU dataset, while maintains strong general VideoQA capabilities.

[116] Pseudo-D: Informing Multi-View Uncertainty Estimation with Calibrated Neural Training Dynamics

Ang Nan Gu,Michael Tsang,Hooman Vaseli,Purang Abolmaesumi,Teresa Tsang

Main category: cs.CV

TL;DR: 论文提出了一种通过神经训练动态(NNTD)生成不确定性伪标签的框架,用于增强医学图像分类中的不确定性估计和鲁棒性。

Details Motivation: 当前医学图像诊断系统使用的标签过于简化(如one-hot标签),忽略了诊断中的不确定性,导致模型在面对噪声或模糊输入时过于自信。论文旨在通过引入不确定性标签来解决这一问题。

Contribution: 1. 提出了一种基于神经网络训练动态(NNTD)的框架,生成不确定性感知的伪标签;2. 该方法与模型架构无关,适用于任何监督学习任务;3. 在超声心动图分类任务中验证了方法的有效性。

Method: 1. 通过训练过程中模型的预测动态评估每个样本的固有难度;2. 聚合并校准这些预测生成伪标签;3. 将这些伪标签用于训练以增强不确定性估计。

Result: 在超声心动图分类任务中,该方法在校准性、选择性分类和多视图融合方面优于现有基线模型。

Insight: 神经网络训练动态能有效捕捉样本的不确定性,为标注困难的医学图像提供了更灵活的标签空间,从而提升模型的鲁棒性和不确定性估计能力。

Abstract: Computer-aided diagnosis systems must make critical decisions from medical images that are often noisy, ambiguous, or conflicting, yet today’s models are trained on overly simplistic labels that ignore diagnostic uncertainty. One-hot labels erase inter-rater variability and force models to make overconfident predictions, especially when faced with incomplete or artifact-laden inputs. We address this gap by introducing a novel framework that brings uncertainty back into the label space. Our method leverages neural network training dynamics (NNTD) to assess the inherent difficulty of each training sample. By aggregating and calibrating model predictions during training, we generate uncertainty-aware pseudo-labels that reflect the ambiguity encountered during learning. This label augmentation approach is architecture-agnostic and can be applied to any supervised learning pipeline to enhance uncertainty estimation and robustness. We validate our approach on a challenging echocardiography classification benchmark, demonstrating superior performance over specialized baselines in calibration, selective classification, and multi-view fusion.

[117] LFRA-Net: A Lightweight Focal and Region-Aware Attention Network for Retinal Vessel Segmentatio

Mehwish Mehmood,Shahzaib Iqbal,Tariq Mahmood Khan,Ivor Spence,Muhammad Fahim

Main category: cs.CV

TL;DR: LFRA-Net 是一种轻量级网络,结合焦点调制注意力(focal modulation attention)和区域感知注意力(region-aware attention),用于视网膜血管分割,实现了高精度和低计算成本。

Details Motivation: 现有深度学习模型在视网膜血管分割中存在对小血管提取不足和计算成本高的问题,亟需一种轻量且高效的解决方案。

Contribution: 提出了 LFRA-Net,通过焦点调制注意力和区域感知注意力提升了特征表示和区域聚焦能力,同时保持了轻量级特性(仅 0.17M 参数)。

Method: 在编码器-解码器瓶颈处引入焦点调制注意力,在选择性跳跃连接中使用区域感知注意力,有效捕获局部和全局依赖关系。

Result: 在 DRIVE、STARE 和 CHASE_DB 数据集上的 Dice 分数和 Jaccard 指数显著优于现有方法,同时计算成本低(0.66 MB 内存,10.50 GFLOPs)。

Insight: LFRA-Net 展示了轻量级网络在高精度分割任务中的潜力,尤其适用于资源有限的临床场景。

Abstract: Retinal vessel segmentation is critical for the early diagnosis of vision-threatening and systemic diseases, especially in real-world clinical settings with limited computational resources. Although significant improvements have been made in deep learning-based segmentation methods, current models still face challenges in extracting tiny vessels and suffer from high computational costs. In this study, we present LFRA-Net by incorporating focal modulation attention at the encoder-decoder bottleneck and region-aware attention in the selective skip connections. LFRA-Net is a lightweight network optimized for precise and effective retinal vascular segmentation. It enhances feature representation and regional focus by efficiently capturing local and global dependencies. LFRA-Net outperformed many state-of-the-art models while maintaining lightweight characteristics with only 0.17 million parameters, 0.66 MB memory size, and 10.50 GFLOPs. We validated it on three publicly available datasets: DRIVE, STARE, and CHASE_DB. It performed better in terms of Dice score (84.28%, 88.44%, and 85.50%) and Jaccard index (72.86%, 79.31%, and 74.70%) on the DRIVE, STARE, and CHASE_DB datasets, respectively. LFRA-Net provides an ideal ratio between segmentation accuracy and computational cost compared to existing deep learning methods, which makes it suitable for real-time clinical applications in areas with limited resources. The code can be found at https://github.com/Mehwish4593/LFRA-Net.

[118] SpecVLM: Fast Speculative Decoding in Vision-Language Models

Haiduo Huang,Fuwei Yang,Zhenhua Liu,Xuanwu Yin,Dong Li,Pengju Ren,Emad Barsoum

Main category: cs.CV

TL;DR: SpecVLM提出了一种高效的推断加速方法,通过弹性视觉压缩器和在线对数蒸馏协议,显著提升了视觉语言模型的推断速度,同时保持了输出质量。

Details Motivation: 视觉语言模型在推断时面临计算和内存开销大的问题,尤其是预填充阶段视觉令牌的膨胀限制了模型的效率。

Contribution: 1. 提出了一个强基线EagleVLM,速度提升1.5-2.3倍;2. 设计了弹性视觉压缩器,动态选择压缩策略;3. 提出了在线对数蒸馏协议,避免昂贵的离线训练。

Method: 结合EagleVLM基线与弹性视觉压缩器,采用在线对数蒸馏协议训练草稿模型,通过交叉熵和平滑L1损失优化。

Result: SpecVLM在5个epoch内实现2.5-2.9倍的端到端加速,适用于多种分辨率和任务难度,保持无损解码。

Insight: 在线训练时间越长,草稿模型的平均接受长度增加,从而提升推断效率。

Abstract: Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens whose count scales with image resolution and video length, inflating both compute and memory, especially the key-value (KV) cache. We study speculative decoding for VLMs and introduce SpecVLM, a practical system that (1) establishes a strong EAGLE-2-style baseline, EagleVLM, delivering 1.5–2.3x end-to-end speedups over full autoregressive inference, and (2) further accelerates VLM inference with an elastic visual compressor that adaptively selects among pruning, pooling, convolution, and resampler primitives to balance FLOPs/parameters and accuracy per input. To avoid costly offline distillation corpora, we propose an online-logit distillation protocol that trains the draft model with on-the-fly teacher logits and penultimate features using a combined cross-entropy and Smooth L1 objective, eliminating storage and preprocessing while remaining compute-efficient. This protocol reveals a training-time scaling effect: longer online training monotonically increases the draft model’s average accepted length, improving speculative efficiency. Empirically, SpecVLM achieves additional acceleration, culminating in 2.5–2.9x end-to-end speedups within 5 epochs across LLaVA and MMMU, consistently over resolutions and task difficulties, while preserving the target model’s output distribution (lossless decoding). Our code is available at https://github.com/haiduo/SpecVLM.

[119] Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation

Tim Lebailly,Vijay Veerabadran,Satwik Kottur,Karl Ridgeway,Michael Louis Iuzzolino

Main category: cs.CV

TL;DR: 该论文通过生成式视觉-语言模型(VLM)生成的合成描述,实现了图像与语言的密集对齐,提升了零样本开放词汇分割任务的性能,同时更具数据效率。

Details Motivation: 现有的生成式视觉-语言模型在高层次图像理解方面表现优秀,但在空间密集的视觉与语言对齐方面表现不足。论文旨在通过合成描述弥补这一不足,提升密集任务(如分割)的零样本推理能力。

Contribution: 提出了一种通过合成描述实现图像与语言密集对齐的方法,显著提升了零样本开放词汇分割任务的性能,并减少了数据需求。

Method: 利用生成式视觉-语言模型生成合成描述(合成标题),并将其用于密集对齐方法中,以增强图像与语言的语义对齐。合成标题成本低、可扩展且易于生成。

Result: 在标准零样本开放词汇分割基准测试中,该方法表现优于现有方法,同时更具数据效率。

Insight: 合成描述是一种高效的语义对齐工具,能够弥补生成式视觉-语言模型在空间密集任务中的不足,为密集对齐任务提供了新的研究方向。

Abstract: Generative vision-language models (VLMs) exhibit strong high-level image understanding but lack spatially dense alignment between vision and language modalities, as our findings indicate. Orthogonal to advancements in generative VLMs, another line of research has focused on representation learning for vision-language alignment, targeting zero-shot inference for dense tasks like segmentation. In this work, we bridge these two directions by densely aligning images with synthetic descriptions generated by VLMs. Synthetic captions are inexpensive, scalable, and easy to generate, making them an excellent source of high-level semantic understanding for dense alignment methods. Empirically, our approach outperforms prior work on standard zero-shot open-vocabulary segmentation benchmarks/datasets, while also being more data-efficient.

[120] Bridging Vision Language Models and Symbolic Grounding for Video Question Answering

Haodi Ma,Vyom Pathak,Daisy Zhe Wang

Main category: cs.CV

TL;DR: 论文探讨了如何在视频问答(VQA)中结合视觉语言模型(VLM)与符号场景图(SG)进行中间接地信号,提出了一种模块化框架SG-VLM,通过提示和视觉定位将VLM与SG结合,提升了因果和时序推理能力。

Details Motivation: 当前视觉语言模型在视频问答中表现良好,但依赖浅层关联,导致时序接地能力弱且解释性不足,需要引入更结构化的符号表示来弥补这些缺陷。

Contribution: 提出SG-VLM框架,通过符号场景图增强视觉语言模型的接地能力和推理能力,在多个基准测试中验证了其有效性。

Method: 构建模块化框架SG-VLM,结合冻结的视觉语言模型与场景图(SG),利用提示和视觉定位实现多模态推理。

Result: 在NExT-QA、iVQA和ActivityNet-QA三个基准测试中,SG-VLM提升了因果和时序推理性能,优于现有基线方法。

Insight: 符号接地在视频理解中具有潜力,但其效果受限于当前实现的局限性,未来需要更多混合VLM-符号方法的探索。

Abstract: Video Question Answering (VQA) requires models to reason over spatial, temporal, and causal cues in videos. Recent vision language models (VLMs) achieve strong results but often rely on shallow correlations, leading to weak temporal grounding and limited interpretability. We study symbolic scene graphs (SGs) as intermediate grounding signals for VQA. SGs provide structured object-relation representations that complement VLMs holistic reasoning. We introduce SG-VLM, a modular framework that integrates frozen VLMs with scene graph grounding via prompting and visual localization. Across three benchmarks (NExT-QA, iVQA, ActivityNet-QA) and multiple VLMs (QwenVL, InternVL), SG-VLM improves causal and temporal reasoning and outperforms prior baselines, though gains over strong VLMs are limited. These findings highlight both the promise and current limitations of symbolic grounding, and offer guidance for future hybrid VLM-symbolic approaches in video understanding.

[121] Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Meng Luo,Shengqiong Wu,Liqiang Jing,Tianjie Ju,Li Zheng,Jinxiang Lai,Tianlong Wu,Xinya Du,Jian Li,Siyuan Yan,Jiebo Luo,William Yang Wang,Hao Fei,Mong-Li Lee,Wynne Hsu

Main category: cs.CV

TL;DR: 论文提出了Dr.V框架,通过细粒度的时空定位诊断视频模型中的幻觉问题,包含数据集Dr.V-Bench和代理Dr.V-Agent,实验证明其有效性和实用性。

Details Motivation: 大型视频模型(LVMs)虽然提升了视频理解能力,但仍存在幻觉问题,即生成与输入视频矛盾的内容。为解决这一问题,论文提出Dr.V框架。

Contribution: 提出Dr.V框架,包含数据集Dr.V-Bench和代理Dr.V-Agent,通过分层次的感知-时序-认知诊断幻觉问题,提供细粒度时空标注和推理能力。

Method: Dr.V采用层次化方法:1) 感知和时序层面通过细粒度时空定位检测幻觉;2) 认知层面进行推理。数据集包含多样化任务和标注。

Result: 实验证明Dr.V-Agent能有效诊断幻觉,增强模型的解释性和可靠性,为实际视频理解任务提供了实用方案。

Insight: 层次化的视频理解框架(感知-时序-认知)有助于解决幻觉问题,细粒度时空标注和推理是关键。数据集和工具的开源推动了领域发展。

Abstract: Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.

[122] Multi-animal tracking in Transition: Comparative Insights into Established and Emerging Methods

Anne Marthe Sophie Ngo Bibinbe,Patrick Gagnon,Jamie Ahloy-Dallaire,Eric R. Paquet

Main category: cs.CV

TL;DR: 该研究比较了传统多动物跟踪(MAT)工具与多目标跟踪(MOT)方法在猪长期跟踪任务中的表现,发现MOT方法在准确性和可靠性上更具优势。

Details Motivation: 精准畜牧业需要先进的监测工具以满足行业日益增长的管理需求。传统的MAT工具在性能上不如最新的MOT方法,导致下游任务(如行为分析和健康状态估计)准确性不足,因此需要对两类方法进行系统比较。

Contribution: 研究对MAT和MOT方法在猪长期跟踪任务中进行了全面基准测试,证明了MOT方法在性能上的优越性,为精准畜牧业提供了更可靠的跟踪解决方案。

Method: 研究采用了多种MAT工具(如DeepLabCut和idTracker)和MOT方法(如ByteTrack、DeepSORT、Track-Anything和PromptTrack),并在一个10分钟的猪跟踪数据集上进行了评估。

Result: 结果表明,MOT方法在长期跟踪任务中整体优于传统的MAT工具,显著提高了跟踪的准确性和可靠性。

Insight: 研究表明,尽管MAT工具在某些场景下更易用,但MOT方法因其更高的性能,更适合用于精准畜牧业的需求。这为未来畜牧业监测技术的发展提供了重要参考。

Abstract: Precision livestock farming requires advanced monitoring tools to meet the increasing management needs of the industry. Computer vision systems capable of long-term multi-animal tracking (MAT) are essential for continuous behavioral monitoring in livestock production. MAT, a specialized subset of multi-object tracking (MOT), shares many challenges with MOT, but also faces domain-specific issues including frequent animal occlusion, highly similar appearances among animals, erratic motion patterns, and a wide range of behavior types. While some existing MAT tools are user-friendly and widely adopted, they often underperform compared to state-of-the-art MOT methods, which can result in inaccurate downstream tasks such as behavior analysis, health state estimation, and related applications. In this study, we benchmarked both MAT and MOT approaches for long-term tracking of pigs. We compared tools such as DeepLabCut and idTracker with MOT-based methods including ByteTrack, DeepSORT, cross-input consistency, and newer approaches like Track-Anything and PromptTrack. All methods were evaluated on a 10-minute pig tracking dataset. Our results demonstrate that, overall, MOT approaches outperform traditional MAT tools, even for long-term tracking scenarios. These findings highlight the potential of recent MOT techniques to enhance the accuracy and reliability of automated livestock tracking.

[123] SAM-TTT: Segment Anything Model via Reverse Parameter Configuration and Test-Time Training for Camouflaged Object Detection

Zhenni Yu,Li Zhao,Guobao Xiao,Xiaoqin Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为SAM-TTT的新方法,通过反向参数配置和测试时训练增强Segment Anything Model (SAM)在伪装目标检测(COD)中的性能。

Details Motivation: 现有的基于SAM的COD模型通常只关注增强有利参数,而忽视了不利参数对语义理解的负面影响,导致下游任务效果不佳。

Contribution: 提出了反向SAM参数配置模块和T-Visioner模块,分别用于抑制不利参数和增强有利参数,显著提升了SAM在COD任务中的表现。

Method: 结合反向参数配置模块(无需训练)和测试时训练层(T-Visioner模块),同时优化不利和有利参数。

Result: 在多个COD基准测试中达到了最先进性能,为该领域设定了新标准。

Insight: 通过反向参数配置和测试时训练的联合优化,能够有效提升模型在复杂视觉任务中的语义理解能力。

Abstract: This paper introduces a new Segment Anything Model (SAM) that leverages reverse parameter configuration and test-time training to enhance its performance on Camouflaged Object Detection (COD), named SAM-TTT. While most existing SAM-based COD models primarily focus on enhancing SAM by extracting favorable features and amplifying its advantageous parameters, a crucial gap is identified: insufficient attention to adverse parameters that impair SAM’s semantic understanding in downstream tasks. To tackle this issue, the Reverse SAM Parameter Configuration Module is proposed to effectively mitigate the influence of adverse parameters in a train-free manner by configuring SAM’s parameters. Building on this foundation, the T-Visioner Module is unveiled to strengthen advantageous parameters by integrating Test-Time Training layers, originally developed for language tasks, into vision tasks. Test-Time Training layers represent a new class of sequence modeling layers characterized by linear complexity and an expressive hidden state. By integrating two modules, SAM-TTT simultaneously suppresses adverse parameters while reinforcing advantageous ones, significantly improving SAM’s semantic understanding in COD task. Our experimental results on various COD benchmarks demonstrate that the proposed approach achieves state-of-the-art performance, setting a new benchmark in the field. The code will be available at https://github.com/guobaoxiao/SAM-TTT.

[124] Logit Mixture Outlier Exposure for Fine-grained Out-of-Distribution Detection

Akito Shinohara,Kohei Fukuda,Hiroaki Aizawa

Main category: cs.CV

TL;DR: 提出一种基于logit空间的线性插值方法,通过混合in-distribution和out-of-distribution数据,提升模型对接近in-distribution的OOD数据的检测性能。

Details Motivation: 现有方法如Outlier Exposure和Mixture Outlier Exposure在OOD检测中表现尚可,但难以有效学习类别关系并清晰区分in-distribution和out-of-distribution数据。

Contribution: 提出logit空间的线性插值技术,平滑类间logits,并通过强制logit与输入空间的一致性提升OOD检测性能(尤其是接近in-distribution的OOD数据)。

Method: 在logit空间混合in-distribution和out-of-distribution数据,实现平滑插值,并与输入空间混合的logits保持一致性。

Result: 实验表明,该方法减少了模型输出在决策边界附近的突变,实现了更平滑的in-distribution和out-of-distribution分离。

Insight: logit空间的特性有助于更好地区分类间分布,混合技术可提升模型对边界OOD数据的敏感度。

Abstract: The ability to detect out-of-distribution data is essential not only for ensuring robustness against unknown or unexpected input data but also for improving the generalization performance of the model. Among various out-of-distribution detection methods, Outlier Exposure and Mixture Outlier Exposure are promising approaches that enhance out-of-distribution detection performance by exposing the outlier data during training. However, even with these sophisticated techniques, it remains challenging for models to learn the relationships between classes effectively and to distinguish data sampling from in-distribution and out-of-distribution clearly. Therefore, we focus on the logit space, where the properties between class-wise distributions are distinctly separated from those in the input or feature spaces. Specifically, we propose a linear interpolation technique in the logit space that mixes in-distribution and out-of-distribution data to facilitate smoothing logits between classes and improve the out-of-distribution detection performance, particularly for out-of-distribution data that lie close to the in-distribution data. Additionally, we enforce consistency between the logits obtained through mixing in the logit space and those generated via mixing in the input space. Our experiments demonstrate that our logit-space mixing technique reduces the abrupt fluctuations in the model outputs near the decision boundaries, resulting in smoother and more reliable separation between in-distribution and out-of-distribution data. Furthermore, we evaluate the effectiveness of the proposed method on a fine-grained out-of-distribution detection task.

[125] Integrating Prior Observations for Incremental 3D Scene Graph Prediction

Marian Renz,Felix Igelbrink,Martin Atzmueller

Main category: cs.CV

TL;DR: 这篇论文提出了一种新颖的异构图模型,用于增量式3D语义场景图(3DSSG)预测,整合多模态信息(如先验观测数据),提高了在复杂现实环境中的泛化能力。

Details Motivation: 现有3DSSG方法主要依赖传感器数据,且通常假设场景重建完整,限制了在增量式场景下的实用性。论文旨在利用多模态信息(如语义嵌入和先验观测)提升模型的适应性和扩展性。

Contribution: 提出了一种异构图模型,可直接在消息传递过程中整合多模态信息,无需复杂模块或完整场景重建,适用于增量式3DSSG预测。

Method: 采用多层异构图模型,结合全局与局部场景表示,利用语义嵌入(如CLIP)和先验观测数据增强图神经网络(GNN)的性能。

Result: 在3DSSG数据集上的实验表明,整合多模态信息的GNN具有更好的可扩展性和泛化能力。

Insight: 增量式场景图预测可通过多模态信息提升性能,这对机器人和具身AI中的环境理解具有重要意义。

Abstract: 3D semantic scene graphs (3DSSG) provide compact structured representations of environments by explicitly modeling objects, attributes, and relationships. While 3DSSGs have shown promise in robotics and embodied AI, many existing methods rely mainly on sensor data, not integrating further information from semantically rich environments. Additionally, most methods assume access to complete scene reconstructions, limiting their applicability in real-world, incremental settings. This paper introduces a novel heterogeneous graph model for incremental 3DSSG prediction that integrates additional, multi-modal information, such as prior observations, directly into the message-passing process. Utilizing multiple layers, the model flexibly incorporates global and local scene representations without requiring specialized modules or full scene reconstructions. We evaluate our approach on the 3DSSG dataset, showing that GNNs enriched with multi-modal information such as semantic embeddings (e.g., CLIP) and prior observations offer a scalable and generalizable solution for complex, real-world environments. The full source code of the presented architecture will be made available at https://github.com/m4renz/incremental-scene-graph-prediction.

[126] NeuroGaze-Distill: Brain-informed Distillation and Depression-Inspired Geometric Priors for Robust Facial Emotion Recognition

Zilin Li,Weiwei Xu,Xuanqi Zhao,Yiran Zhu

Main category: cs.CV

TL;DR: NeuroGaze-Distill通过跨模态蒸馏框架,将脑信号先验知识转移到仅依赖图像的FER模型中,结合静态V/A原型和抑郁启发几何先验,提高了模型的鲁棒性和泛化能力。

Details Motivation: 传统FER模型仅依赖像素数据,难以跨数据集泛化,因面部表情是情感的间接且有偏代理。本研究旨在利用脑信号先验提升模型鲁棒性。

Contribution: 1. 提出NeuroGaze-Distill框架,将脑信号先验蒸馏到图像FER模型;2. 引入静态V/A原型和抑郁启发几何先验(D-Geo);3. 提高了模型的跨数据集性能和鲁棒性。

Method: 1. 训练基于EEG地形图的教师模型,生成5x5 V/A原型网格;2. 学生模型(ResNet-18/50)通过Proto-KD(特征对齐)和D-Geo(几何先验正则化)训练;3. 仅需图像数据部署,无需脑信号配对。

Result: 在FERPlus验证集和跨数据集(AffectNet-mini/CK+)测试中,模型表现优于基线,Proto-KD和D-Geo均带来稳定增益,5x5网格表现最佳。

Insight: 1. 脑信号先验可有效提升FER模型的鲁棒性;2. 抑郁研究中的几何发现可作为FER任务的额外监督信号;3. 简单的蒸馏框架能实现高性能,无需复杂架构。

Abstract: Facial emotion recognition (FER) models trained only on pixels often fail to generalize across datasets because facial appearance is an indirect and biased proxy for underlying affect. We present NeuroGaze-Distill, a cross-modal distillation framework that transfers brain-informed priors into an image-only FER student via static Valence/Arousal (V/A) prototypes and a depression-inspired geometric prior (D-Geo). A teacher trained on EEG topographic maps from DREAMER (with MAHNOB-HCI as unlabeled support) produces a consolidated 5x5 V/A prototype grid that is frozen and reused; no EEG-face pairing and no non-visual signals at deployment are required. The student (ResNet-18/50) is trained on FERPlus with conventional CE/KD and two lightweight regularizers: (i) Proto-KD (cosine) aligns student features to the static prototypes; (ii) D-Geo softly shapes the embedding geometry in line with affective findings often reported in depression research (e.g., anhedonia-like contraction in high-valence regions). We evaluate both within-domain (FERPlus validation) and cross-dataset protocols (AffectNet-mini; optional CK+), reporting standard 8-way scores alongside present-only Macro-F1 and balanced accuracy to fairly handle label-set mismatch. Ablations attribute consistent gains to prototypes and D-Geo, and favor 5x5 over denser grids for stability. The method is simple, deployable, and improves robustness without architectural complexity.

[127] Enriched text-guided variational multimodal knowledge distillation network (VMD) for automated diagnosis of plaque vulnerability in 3D carotid artery MRI

Bo Cao,Fan Yu,Mengmeng Feng,SenHao Zhang,Xin Meng,Yue Zhang,Zhen Qian,Jie Lu

Main category: cs.CV

TL;DR: 论文提出了一种基于变分推断和多模态知识蒸馏(VMD)的方法,用于自动化诊断3D颈动脉MRI中的斑块易损性,通过融合放射科医生的领域知识和多模态数据提升诊断准确性。

Details Motivation: 传统的3D视觉网络和放射科医生在直接从颈动脉3D MRI图像诊断斑块易损性时面临挑战。放射科医生依赖多模态方法和专业知识,这启发了开发多模态诊断网络的思路。

Contribution: 论文的主要贡献是提出了VMD方法,通过变分推断和多模态知识蒸馏,有效利用有限的图像标注和放射报告中的跨模态先验知识,提升诊断网络的性能。

Method: 方法结合了变分推断和多模态知识蒸馏,通过融合多模态数据和领域知识,优化了对未标注3D MRI图像的诊断准确性。

Result: 在内部数据集上的实验验证了VMD策略的有效性,表明该方法能够显著提升诊断性能。

Insight: 研究揭示了多模态数据和领域知识融合在医学影像诊断中的重要性,为未来类似任务提供了借鉴。

Abstract: Multimodal learning has attracted much attention in recent years due to its ability to effectively utilize data features from a variety of different modalities. Diagnosing the vulnerability of atherosclerotic plaques directly from carotid 3D MRI images is relatively challenging for both radiologists and conventional 3D vision networks. In clinical practice, radiologists assess patient conditions using a multimodal approach that incorporates various imaging modalities and domain-specific expertise, paving the way for the creation of multimodal diagnostic networks. In this paper, we have developed an effective strategy to leverage radiologists’ domain knowledge to automate the diagnosis of carotid plaque vulnerability through Variation inference and Multimodal knowledge Distillation (VMD). This method excels in harnessing cross-modality prior knowledge from limited image annotations and radiology reports within training data, thereby enhancing the diagnostic network’s accuracy for unannotated 3D MRI images. We conducted in-depth experiments on the dataset collected in-house and verified the effectiveness of the VMD strategy we proposed.

[128] Graph Algorithm Unrolling with Douglas-Rachford Iterations for Image Interpolation with Guaranteed Initialization

Xue Zhang,Bingshuo Hu,Gene Cheung

Main category: cs.CV

TL;DR: 该论文提出了一种基于图算法展开和Douglas-Rachford迭代的图像插值方法,通过初始化图邻接矩阵和扰动矩阵,构建轻量级可解释网络,实现了性能的提升和参数量的大幅减少。

Details Motivation: 传统深度神经网络通过随机初始化参数并使用随机梯度下降进行优化,容易陷入局部最小值。论文旨在通过结合图算法和Douglas-Rachford迭代,提出一种更可靠的初始化方法,并学习扰动矩阵以提升性能。

Contribution: 论文的主要贡献包括:(1)提出了一种基于图算法的图像插值方法,(2)利用Douglas-Rachford迭代展开为轻量级网络,(3)通过学习扰动矩阵进一步提升性能,(4)实现了参数量的大幅减少和性能的提升。

Method: 方法分为两步:首先基于已知插值器初始化有向图邻接矩阵A,作为性能基准;然后学习扰动矩阵P和P(2),通过Douglas-Rachford迭代将其展开为轻量级的可解释网络。

Result: 实验结果表明,该方法在图像插值任务中达到了最先进的性能,同时大幅减少了网络参数量。

Insight: 结合图算法的理论框架和迭代优化方法,可以为图像处理任务提供更高效的解决方案,同时保持模型的轻量化和可解释性。

Abstract: Conventional deep neural nets (DNNs) initialize network parameters at random and then optimize each one via stochastic gradient descent (SGD), resulting in substantial risk of poor-performing local minima.Focusing on the image interpolation problem and leveraging a recent theorem that maps a (pseudo-)linear interpolator {\Theta} to a directed graph filter that is a solution to a MAP problem regularized with a graph shift variation (GSV) prior, we first initialize a directed graph adjacency matrix A based on a known interpolator {\Theta}, establishing a baseline performance.Then, towards further gain, we learn perturbation matrices P and P(2) from data to augment A, whose restoration effects are implemented via Douglas-Rachford (DR) iterations, which we unroll into a lightweight interpretable neural net.Experimental results demonstrate state-of-the-art image interpolation results, while drastically reducing network parameters.

[129] Sphere-GAN: a GAN-based Approach for Saliency Estimation in 360° Videos

Mahmoud Z. A. Wahba,Sara Baldoni,Federica Battisti

Main category: cs.CV

TL;DR: Sphere-GAN利用球形卷积的生成对抗网络,提出了一种在360度视频中进行显著性检测的新方法,并通过实验验证其优于现有技术。

Details Motivation: 随着沉浸式应用的兴起,处理360度视频的需求增加,其中显著性检测是关键。然而,现有的显著性检测方法主要针对2D内容,360度视频的相关研究较少。本文旨在填补这一空白。

Contribution: 提出了Sphere-GAN,首次将球形卷积与GAN结合用于360度视频的显著性检测,显著提升了准确性。

Method: 采用生成对抗网络(GAN),并结合球形卷积层,以适应360度视频的特殊几何结构。

Result: 在公开的360度视频显著性数据集上,Sphere-GAN的性能超越了现有最优模型。

Insight: 球形卷积能有效捕捉360度视频的球面几何特性,为类似任务提供了一种创新的网络设计思路。

Abstract: The recent success of immersive applications is pushing the research community to define new approaches to process 360{\deg} images and videos and optimize their transmission. Among these, saliency estimation provides a powerful tool that can be used to identify visually relevant areas and, consequently, adapt processing algorithms. Although saliency estimation has been widely investigated for 2D content, very few algorithms have been proposed for 360{\deg} saliency estimation. Towards this goal, we introduce Sphere-GAN, a saliency detection model for 360{\deg} videos that leverages a Generative Adversarial Network with spherical convolutions. Extensive experiments were conducted using a public 360{\deg} video saliency dataset, and the results demonstrate that Sphere-GAN outperforms state-of-the-art models in accurately predicting saliency maps.

[130] CLAIRE: A Dual Encoder Network with RIFT Loss and Phi-3 Small Language Model Based Interpretability for Cross-Modality Synthetic Aperture Radar and Optical Land Cover Segmentation

Debopom Sutradhar,Arefin Ittesafun Abian,Mohaimenul Azam Khan Raiaan,Reem E. Mohamed,Sheikh Izzal Azid,Sami Azam

Main category: cs.CV

TL;DR: 论文提出了一种双编码器网络CLAIRE,用于光学和SAR图像的跨模态土地覆盖分割,通过改进的融合机制和混合损失函数RIFT解决了类别不平衡问题,同时引入小型语言模型Phi-3增强解释性。

Details Motivation: 土地覆盖分类的复杂性、类别相似性和数据集不平衡是主要挑战,需要一种能够融合多模态信息并解决不平衡问题的方法。

Contribution: 1. 提出双编码器架构和跨模态注意力融合模块CLAIRE;2. 设计混合损失函数RIFT解决类别不平衡;3. 引入Phi-3小型语言模型增强解释性。

Method: 1. 独立提取光学和SAR图像特征;2. 通过CLAIRE模块融合多模态特征;3. 使用RIFT损失优化分割性能;4. 利用Phi-3生成预测解释。

Result: 在多个数据集上表现优异,如WHU-OPT-SAR的mIoU为56.02%,OA为84.56%。在云遮挡条件下鲁棒性强。

Insight: 跨模态特征融合和类别不平衡处理是关键,同时小型语言模型可以提升模型的透明度和实用性。

Abstract: Accurate land cover classification from satellite imagery is crucial in environmental monitoring and sustainable resource management. However, it remains challenging due to the complexity of natural landscapes, the visual similarity between classes, and the significant class imbalance in the available datasets. To address these issues, we propose a dual encoder architecture that independently extracts modality-specific features from optical and Synthetic Aperture Radar (SAR) imagery, which are then fused using a cross-modality attention-fusion module named Cross-modality Land cover segmentation with Attention and Imbalance-aware Reasoning-Enhanced Explanations (CLAIRE). This fusion mechanism highlights complementary spatial and textural features, enabling the network to better capture detailed and diverse land cover patterns. We incorporate a hybrid loss function that utilizes Weighted Focal Loss and Tversky Loss named RIFT (Rare-Instance Focal-Tversky) to address class imbalance and improve segmentation performance across underrepresented categories. Our model achieves competitive performance across multiple benchmarks: a mean Intersection over Union (mIoU) of 56.02% and Overall Accuracy (OA) of 84.56% on the WHU-OPT-SAR dataset; strong generalization with a mIoU of 59.89% and OA of 73.92% on the OpenEarthMap-SAR dataset; and remarkable robustness under cloud-obstructed conditions, achieving an mIoU of 86.86% and OA of 94.58% on the PIE-RGB-SAR dataset. Additionally, we introduce a metric-driven reasoning module generated by a Small Language Model (Phi-3), which generates expert-level, sample-specific justifications for model predictions, thereby enhancing transparency and interpretability.

[131] Lost in Embeddings: Information Loss in Vision-Language Models

Wenyan Li,Raphael Tang,Chengzu Li,Caiqi Zhang,Ivan Vulić,Anders Søgaard

Main category: cs.CV

TL;DR: 该论文研究了视觉语言模型(VLMs)中视觉输入通过投影步骤进入语言模型嵌入空间时的信息损失问题,并提出两种方法来量化和分析这种损失。

Details Motivation: 视觉语言模型在处理多模态任务时,视觉输入经过投影步骤可能会导致信息损失,但目前这一现象及其对模型性能的影响尚未被充分研究。

Contribution: 论文的主要贡献是提出了两种互补的方法来量化投影步骤导致的信息损失,并通过实验揭示了这种损失对模型性能的具体影响。

Method: 1. 通过分析投影前后图像表示的k近邻关系变化来评估语义信息的保留情况;2. 通过从投影后的表示重建视觉嵌入,直接在图像块级别定位信息损失。

Result: 实验结果表明,投影步骤显著扭曲了视觉表示的局部几何结构,k近邻关系的差异高达40-60%,且与检索性能下降相关。此外,重建方法揭示了模型在视觉问答任务中表现不佳的实例与高信息损失区域的关联。

Insight: 论文揭示了视觉语言模型中投影步骤的信息损失问题,为改进多模态融合提供了理论依据和实用方法。

Abstract: Vision–language models (VLMs) often process visual inputs through a pretrained vision encoder, followed by a projection into the language model’s embedding space via a connector component. While crucial for modality fusion, the potential information loss induced by this projection step and its direct impact on model capabilities remain understudied. We introduce two complementary approaches to examine and quantify this loss by analyzing the latent representation space. First, we evaluate semantic information preservation by analyzing changes in k-nearest neighbor relationships between image representations, before and after projection. Second, we directly measure information loss by reconstructing visual embeddings from the projected representation, localizing loss at an image patch level. Experiments reveal that connectors substantially distort the local geometry of visual representations, with k-nearest neighbors diverging by 40–60% post-projection, correlating with degradation in retrieval performance. The patch-level embedding reconstruction provides interpretable insights for model behavior on visually grounded question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.

[132] Learning to Generate 4D LiDAR Sequences

Ao Liang,Youquan Liu,Yu Yang,Dongyue Lu,Linfeng Li,Lingdong Kong,Huaici Zhao,Wei Tsang Ooi

Main category: cs.CV

TL;DR: LiDARCrafter 是一种统一的框架,通过自由形式的语言输入生成可编辑的 4D LiDAR 序列,并在 nuScenes 数据集上实现了最先进的保真度、可控性和时间一致性。

Details Motivation: 尽管生成模型在视频和基于占用数据的生成方面取得了进展,但 LiDAR 数据的生成仍然未被充分探索,而其对于精确 3D 感知至关重要。生成 4D LiDAR 数据在可控性、时间稳定性和评估方面提出了新挑战。

Contribution: 提出了 LiDARCrafter 框架,能够将自由形式的语言转换为可编辑的 LiDAR 序列;开发了 EvalSuite 评估基准,覆盖场景、物体和序列级别的指标。

Method: 1. 将指令解析为自我中心的场景图;2. 使用三分支扩散模型生成物体布局、轨迹和形状;3. 通过范围图像扩散模型生成初始扫描;4. 自回归模块扩展为时间连贯的序列。

Result: 在 nuScenes 数据集上,LiDARCrafter 在保真度、可控性和时间一致性方面实现了最先进的性能。

Insight: 通过语言输入生成 4D LiDAR 序列的能力为基于 LiDAR 的仿真和数据增强提供了新的可能性。

Abstract: While generative world models have advanced video and occupancy-based data synthesis, LiDAR generation remains underexplored despite its importance for accurate 3D perception. Extending generation to 4D LiDAR data introduces challenges in controllability, temporal stability, and evaluation. We present LiDARCrafter, a unified framework that converts free-form language into editable LiDAR sequences. Instructions are parsed into ego-centric scene graphs, which a tri-branch diffusion model transforms into object layouts, trajectories, and shapes. A range-image diffusion model generates the initial scan, and an autoregressive module extends it into a temporally coherent sequence. The explicit layout design further supports object-level editing, such as insertion or relocation. To enable fair assessment, we provide EvalSuite, a benchmark spanning scene-, object-, and sequence-level metrics. On nuScenes, LiDARCrafter achieves state-of-the-art fidelity, controllability, and temporal consistency, offering a foundation for LiDAR-based simulation and data augmentation.

[133] Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Bingyu Li,Haocheng Dong,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li

Main category: cs.CV

TL;DR: 该论文提出了一个专门针对遥感领域的开放词汇分割框架RSKT-Seg,通过建立标准化基准OVRSISBench,评估现有方法并揭示其局限性,最终设计出高效的RSKT-Seg框架,显著优于基线方法。

Details Motivation: 遥感领域的开放词汇分割任务(OVRSIS)缺乏统一的评估基准,且存在遥感图像与自然图像的领域差距问题,亟待解决。

Contribution: 1. 建立标准化评估基准OVRSISBench;2. 提出新颖的RSKT-Seg框架,包含三个关键模块:RS-CMA、RS-Fusion和RS-Transfer。

Method: RSKT-Seg框架通过多方向成本图聚合(RS-CMA)、高效成本图融合(RS-Fusion)和遥感知识迁移(RS-Transfer)模块,结合轻量级降维策略,实现高效分割。

Result: RSKT-Seg在基准测试中优于基线方法(+3.8 mIoU和+5.9 mACC),且推理速度快2倍。

Insight: 遥感领域的分割任务需要针对性设计,且领域知识和高效模块集成对性能提升至关重要。

Abstract: Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.

[134] Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Pu Jian,Junhong Wu,Wei Sun,Chen Wang,Shuo Ren,Jiajun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Reflection-V的新型视觉推理模型(VRM),通过视觉反射能力增强来解决现有模型在长响应中视觉注意力下降的问题。方法包括构建以视觉为中心的推理数据和设计基于视觉注意力的强化学习奖励机制。

Details Motivation: 当前视觉推理模型在处理长响应时视觉注意力迅速减弱,限制了模型的视觉反射能力。因此,作者希望通过增强视觉反射能力,提升模型在视觉推理任务中的表现。

Contribution: 1. 提出Reflection-V模型,通过视觉反射能力增强提升视觉推理性能;2. 构建了以视觉为中心的推理数据用于冷启动学习;3. 设计了基于视觉注意力的强化学习奖励机制。

Method: 1. 使用代理交互框架构建以视觉为中心的推理数据;2. 在强化学习中引入基于视觉注意力的奖励模型,鼓励模型依赖视觉信息进行推理。

Result: Reflection-V在多视觉推理基准测试中表现显著优于现有模型,且在推理过程中对视觉信息的依赖更一致。

Insight: 视觉反射能力对视觉推理模型至关重要,通过数据构建和奖励设计可以有效增强这一能力。

Abstract: Recent advances in text-only “slow-thinking” reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (\textbf{VRMs}). owever, such transfer faces critical challenges: Effective “slow thinking” in VRMs requires \textbf{visual reflection}, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM \textbf{Reflection-V}, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, \textbf{Reflection-V} demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, \textbf{Reflection-V} maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.

[135] A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset

Haiyu Yang,Enhong Liu,Jennifer Sun,Sumit Sharma,Meike van Leerdam,Sebastien Franceschini,Puchun Niu,Miel Hostens

Main category: cs.CV

TL;DR: 论文提出了一种模块化的计算机视觉管道,用于自动化动物行为分析,特别针对群体饲养环境中的猪。结合了零样本目标检测、运动感知跟踪与分割以及视觉Transformer的特征提取,显著提升了行为识别的准确性。

Details Motivation: 传统的手动观察方法耗时、主观且难以扩展,而自动化行为分析可以为农业环境中的动物福利和生产力提供更高效的解决方案。

Contribution: 1. 开发了一个模块化的计算机视觉管道,结合多种先进技术,实现了对群体饲养环境中猪行为的准确分析。
2. 在爱丁堡猪行为视频数据集上验证了系统的有效性,性能显著优于现有方法(94.2%的准确率)。
3. 提供了开源实现,为精准养猪和福利评估提供了自动化、客观且持续的分析工具。

Method: 1. 使用零样本目标检测模型识别和定位动物。
2. 结合运动感知跟踪和分割技术处理遮挡和群体场景。
3. 利用视觉Transformer提取高级特征,用于行为识别。

Result: 1. 时序模型的总体准确率达94.2%,比现有方法提升21.2个百分点。
2. 跟踪能力表现稳健,身份保持得分为93.3%,目标检测精度为89.3%。

Insight: 模块化设计和开源实现使得系统能够适应其他场景和物种,但需要进一步的跨物种验证。该技术为动物行为分析提供了高效、客观且可扩展的解决方案。

Abstract: Animal behavior analysis plays a crucial role in understanding animal welfare, health status, and productivity in agricultural settings. However, traditional manual observation methods are time-consuming, subjective, and limited in scalability. We present a modular pipeline that leverages open-sourced state-of-the-art computer vision techniques to automate animal behavior analysis in a group housing environment. Our approach combines state-of-the-art models for zero-shot object detection, motion-aware tracking and segmentation, and advanced feature extraction using vision transformers for robust behavior recognition. The pipeline addresses challenges including animal occlusions and group housing scenarios as demonstrated in indoor pig monitoring. We validated our system on the Edinburgh Pig Behavior Video Dataset for multiple behavioral tasks. Our temporal model achieved 94.2% overall accuracy, representing a 21.2 percentage point improvement over existing methods. The pipeline demonstrated robust tracking capabilities with 93.3% identity preservation score and 89.3% object detection precision. The modular design suggests potential for adaptation to other contexts, though further validation across species would be required. The open-source implementation provides a scalable solution for behavior monitoring, contributing to precision pig farming and welfare assessment through automated, objective, and continuous analysis.

[136] AvatarSync: Rethinking Talking-Head Animation through Autoregressive Perspective

Yuchen Deng,Xiuyang Wu,Hai-Tao Zheng,Suiyang Zhang,Yi He,Yuxing Han

Main category: cs.CV

TL;DR: AvatarSync是一种基于自回归语音表征的说话头动画框架,通过两阶段生成策略(关键帧生成和帧间插值)解决了现有方法中的帧间闪烁、身份漂移和推理速度慢等问题,实现了高视觉保真度、时间一致性和计算效率。

Details Motivation: 现有基于GAN或扩散模型的说话头动画方法存在帧间闪烁、身份漂移和推理速度慢等问题,限制了其实际应用。AvatarSync旨在通过自回归框架和两阶段生成策略解决这些限制。

Contribution: 1. 提出AvatarSync,一种基于自回归语音表征的说话头动画框架。2. 采用两阶段生成策略(语义建模与视觉动态解耦)。3. 引入时间戳感知的自适应策略,优化推理流程以减少延迟。

Method: 1. 第一阶段(FKG):通过音素级语义表征生成面部关键帧,结合文本-帧因果注意力掩码。2. 第二阶段:基于选择性状态空间模型实现帧间插值,强调时间一致性和视觉平滑性。

Result: AvatarSync在视觉保真度、时间一致性和计算效率上优于现有方法,提供了可扩展且可控的解决方案。

Insight: 两阶段生成策略(语义与动态解耦)和时间戳感知的自适应插值是提升说话头动画质量的关键。

Abstract: Existing talking-head animation approaches based on Generative Adversarial Networks (GANs) or diffusion models often suffer from inter-frame flicker, identity drift, and slow inference. These limitations inherent to their video generation pipelines restrict their suitability for applications. To address this, we introduce AvatarSync, an autoregressive framework on phoneme representations that generates realistic and controllable talking-head animations from a single reference image, driven directly text or audio input. In addition, AvatarSync adopts a two-stage generation strategy, decoupling semantic modeling from visual dynamics, which is a deliberate “Divide and Conquer” design. The first stage, Facial Keyframe Generation (FKG), focuses on phoneme-level semantic representation by leveraging the many-to-one mapping from text or audio to phonemes. A Phoneme-to-Visual Mapping is constructed to anchor abstract phonemes to character-level units. Combined with a customized Text-Frame Causal Attention Mask, the keyframes are generated. The second stage, inter-frame interpolation, emphasizes temporal coherence and visual smoothness. We introduce a timestamp-aware adaptive strategy based on a selective state space model, enabling efficient bidirectional context reasoning. To support deployment, we optimize the inference pipeline to reduce latency without compromising visual fidelity. Extensive experiments show that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency, providing a scalable and controllable solution.

[137] End-to-End Learning of Multi-Organ Implicit Surfaces from 3D Medical Imaging Data

Farahdiba Zarin,Nicolas Padoy,Jérémy Dana,Vinkle Srivastav

Main category: cs.CV

TL;DR: ImplMORe是一种端到端的深度学习方法,利用隐式表面表示从3D医学图像中重建多器官。它通过3D CNN编码器提取局部特征,并利用多尺度插值在连续域中学习特征,性能优于离散显式表示方法。

Details Motivation: 医学图像中器官的精细表面重建通常受限于分辨率,高分辨率需要更多内存和计算资源。传统方法难以直接应用于医学图像,而隐式表示方法可以解决这一问题。

Contribution: 提出ImplMORe方法,首次将隐式表示应用于多器官3D重建;通过多尺度插值和连续域特征学习,实现了精细表面重建,突破了输入图像分辨率的限制。

Method: 使用3D CNN编码器提取局部特征,结合多尺度插值在连续域中学习特征,利用占有率函数实现隐式表面表示。

Result: 在totalsegmentator数据集上,ImplMORe表现优于离散显式表示方法,能够生成比输入图像分辨率更高的精细器官表面。

Insight: 隐式表示在医学图像重建中具有潜力,结合连续域特征学习和多尺度插值可以有效提升表面重建的精细程度。

Abstract: The fine-grained surface reconstruction of different organs from 3D medical imaging can provide advanced diagnostic support and improved surgical planning. However, the representation of the organs is often limited by the resolution, with a detailed higher resolution requiring more memory and computing footprint. Implicit representations of objects have been proposed to alleviate this problem in general computer vision by providing compact and differentiable functions to represent the 3D object shapes. However, architectural and data-related differences prevent the direct application of these methods to medical images. This work introduces ImplMORe, an end-to-end deep learning method using implicit surface representations for multi-organ reconstruction from 3D medical images. ImplMORe incorporates local features using a 3D CNN encoder and performs multi-scale interpolation to learn the features in the continuous domain using occupancy functions. We apply our method for single and multiple organ reconstructions using the totalsegmentator dataset. By leveraging the continuous nature of occupancy functions, our approach outperforms the discrete explicit representation based surface reconstruction approaches, providing fine-grained surface details of the organ at a resolution higher than the given input image. The source code will be made publicly available at: https://github.com/CAMMA-public/ImplMORe

[138] Progressive Flow-inspired Unfolding for Spectral Compressive Imaging

Xiaodong Wang,Ping Wang,Zijun He,Mengjie Qin,Xin Yuan

Main category: cs.CV

TL;DR: 该论文提出了一种渐进流启发的展开框架(Progressive Flow-inspired Unfolding),用于解决编码孔径快照光谱成像(CASSI)的重建问题,通过控制重建轨迹实现了平滑、连续的优化路径。

Details Motivation: 现有的深度展开网络(DUNs)在CASSI重建中存在重建轨迹不可控的问题,导致重建质量出现跳跃和非渐进性优化,影响了重建效果和稳定性。

Contribution: 1. 提出了一种轨迹可控的展开框架,确保从初始估计到高质量重建的平滑优化路径;
2. 设计了一种高效的时空Transformer和频域融合模块以提高重建效率和质量。

Method: 1. 利用扩散轨迹和流匹配启发,设计渐进流展开框架;
2. 引入时空Transformer和频域融合模块,优化特征一致性。

Result: 在仿真和真实数据上的实验表明,该方法在重建质量和效率上优于当前最优方法。

Insight: 通过控制重建轨迹的平滑性,可以有效提升CASSI重建的质量和稳定性,为高光谱图像重建任务提供了一种新的优化思路。

Abstract: Coded aperture snapshot spectral imaging (CASSI) retrieves a 3D hyperspectral image (HSI) from a single 2D compressed measurement, which is a highly challenging reconstruction task. Recent deep unfolding networks (DUNs), empowered by explicit data-fidelity updates and implicit deep denoisers, have achieved the state of the art in CASSI reconstruction. However, existing unfolding approaches suffer from uncontrollable reconstruction trajectories, leading to abrupt quality jumps and non-gradual refinement across stages. Inspired by diffusion trajectories and flow matching, we propose a novel trajectory-controllable unfolding framework that enforces smooth, continuous optimization paths from noisy initial estimates to high-quality reconstructions. To achieve computational efficiency, we design an efficient spatial-spectral Transformer tailored for hyperspectral reconstruction, along with a frequency-domain fusion module to gurantee feature consistency. Experiments on simulation and real data demonstrate that our method achieves better reconstruction quality and efficiency than prior state-of-the-art approaches.

[139] FS-SAM2: Adapting Segment Anything Model 2 for Few-Shot Semantic Segmentation via Low-Rank Adaptation

Bernardo Forni,Gabriele Lombardi,Federico Pozzi,Mirco Planamente

Main category: cs.CV

TL;DR: FS-SAM2通过低秩适应(LoRA)调整Segment Anything Model 2(SAM2),将其零-shot分割能力扩展到少样本语义分割任务中,显著提升了性能并保持了计算效率。

Details Motivation: 少样本语义分割的目标是通过少量标注样本分割未见类别。现有方法通常需要在大规模数据集上从头训练附加模块,而SAM2作为一种基础模型已具备强大的零-shot分割能力。本文旨在直接利用SAM2的模块化设计,通过低秩适应快速适配少样本任务。

Contribution: 1)提出FS-SAM2方法,将SAM2的视频分割能力直接迁移到少样本任务;2)引入LoRA技术,仅调整少量参数即可兼容多样化的图像数据;3)支持任意K-shot配置,并在多个数据集上验证了高性能和高效性。

Method: 1)利用SAM2的模块化设计,将其视频分割模块直接用于少样本任务;2)通过LoRA对SAM2的原生模块进行低秩适应,使其适应非时序的标准数据集图像;3)仅对小规模参数进行元训练,保留了SAM2的分割性能。

Result: 在PASCAL-5$^i$、COCO-20$^i$和FSS-1000数据集上取得了显著效果,同时推理阶段表现出优秀的计算效率。

Insight: 利用基础模型的模块化和零-shot能力,结合低秩适应技术,可以在少样本任务中高效迁移并保持高性能。

Abstract: Few-shot semantic segmentation has recently attracted great attention. The goal is to develop a model capable of segmenting unseen classes using only a few annotated samples. Most existing approaches adapt a pre-trained model by training from scratch an additional module. Achieving optimal performance with these approaches requires extensive training on large-scale datasets. The Segment Anything Model 2 (SAM2) is a foundational model for zero-shot image and video segmentation with a modular design. In this paper, we propose a Few-Shot segmentation method based on SAM2 (FS-SAM2), where SAM2’s video capabilities are directly repurposed for the few-shot task. Moreover, we apply a Low-Rank Adaptation (LoRA) to the original modules in order to handle the diverse images typically found in standard datasets, unlike the temporally connected frames used in SAM2’s pre-training. With this approach, only a small number of parameters is meta-trained, which effectively adapts SAM2 while benefiting from its impressive segmentation performance. Our method supports any K-shot configuration. We evaluate FS-SAM2 on the PASCAL-5$^i$, COCO-20$^i$ and FSS-1000 datasets, achieving remarkable results and demonstrating excellent computational efficiency during inference. Code is available at https://github.com/fornib/FS-SAM2

[140] RailSafeNet: Visual Scene Understanding for Tram Safety

Ing. Ondrej Valach,Ing. Ivan Gruber

Main category: cs.CV

TL;DR: RailSafeNet 是一个基于深度学习的实时框架,通过语义分割和目标检测结合规则距离评估器,监测铁轨侵入行为,提升电车与行人交互的安全性。

Details Motivation: 电车与行人的交互安全问题日益突出,尤其是在密集区域,传统方法难以满足实时性和准确性需求,亟需一种基于视觉的轻量级解决方案。

Contribution: RailSafeNet 结合语义分割(SegFormer B3)、目标检测(YOLOv8)和规则距离评估器,实现了对铁轨侵入的实时监测和高精度分类。

Method: 使用单目视频输入,先通过语义分割识别铁轨,再用目标检测定位周围物体,最后通过规则评估器计算距离风险。

Result: 在 RailSem19 数据集上,SegFormer B3 的 IoU 达到 65%,YOLOv8 的 mAP@0.5 达到 75.6%,验证了系统的有效性。

Insight: 轻量级框架能够在不依赖大量标注的情况下实现高精度场景理解,为实时安全预警提供实用方案。

Abstract: Tram-human interaction safety is an important challenge, given that trams frequently operate in densely populated areas, where collisions can range from minor injuries to fatal outcomes. This paper addresses the issue from the perspective of designing a solution leveraging digital image processing, deep learning, and artificial intelligence to improve the safety of pedestrians, drivers, cyclists, pets, and tram passengers. We present RailSafeNet, a real-time framework that fuses semantic segmentation, object detection and a rule-based Distance Assessor to highlight track intrusions. Using only monocular video, the system identifies rails, localises nearby objects and classifies their risk by comparing projected distances with the standard 1435mm rail gauge. Experiments on the diverse RailSem19 dataset show that a class-filtered SegFormer B3 model achieves 65% intersection-over-union (IoU), while a fine-tuned YOLOv8 attains 75.6% mean average precision (mAP) calculated at an intersection over union (IoU) threshold of 0.50. RailSafeNet therefore delivers accurate, annotation-light scene understanding that can warn drivers before dangerous situations escalate. Code available at https://github.com/oValach/RailSafeNet.

[141] 3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data

Nojod M. Alotaibi,Areej M. Alhothali,Manar S. Ali

Main category: cs.CV

TL;DR: 该论文提出了一种结合3D Vision Transformer和Graph Neural Network的统一框架,用于基于sMRI数据的抑郁症自动检测,通过两种区域定义策略(基于图谱和基于立方块)提取特征,并结合图神经网络分类,实验表明基于图谱的方法更优。

Details Motivation: 抑郁症是一种常见的心理健康问题,严重影响个人和社会。现有的基于sMRI的深度学习方法通常局限于体素级特征或预定义区域特征,难以捕获复杂脑部模式。本文旨在结合ViT和GNN的优势,提升抑郁症诊断的准确性。

Contribution: 1. 提出了一种统一的3D ViT和GNN框架,结合图谱和立方块两种区域定义方法;2. 通过余弦相似度图建模区域间关系;3. 实验验证了基于图谱的方法优于立方块方法。

Method: 1. 使用3D ViT从sMRI数据中提取区域嵌入;2. 采用两种区域定义策略(基于图谱和基于立方块);3. 生成余弦相似度图用于GNN分类。

Result: 在REST-meta-MDD数据集上,最佳模型达到78.98%的准确率,76.54%的敏感度和81.58%的特异度,基于图谱的方法表现更优。

Insight: 1. 结合ViT和GNN可以更有效地建模脑部复杂模式;2. 基于解剖学先验的图谱方法对于抑郁症检测更具优势。

Abstract: Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 78.98% accuracy, 76.54% sensitivity, 81.58% specificity, 81.58% precision, and 78.98% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.

[142] Open-ended Hierarchical Streaming Video Understanding with Vision Language Models

Hyolim Kang,Yunsu Park,Youngbeom Yoo,Yeeun Choi,Seon Joo Kim

Main category: cs.CV

TL;DR: 该论文提出了Hierarchical Streaming Video Understanding任务,结合在线时序动作定位和自由形式描述生成,并通过OpenHOUSE系统提升流式动作感知性能。

Details Motivation: 解决现有数据集中层次化、细粒度时序标注的缺乏问题,并探索生成模型在流式动作感知中的应用。

Contribution: 1) 提出利用LLMs将原子动作分组为高层事件,丰富数据集;2) 设计了OpenHOUSE系统,显著提升相邻动作边界检测性能。

Method: 1) 使用LLMs进行动作分组;2) 开发专用流式模块,优化动作边界检测。

Result: OpenHOUSE在动作边界检测上性能接近翻倍,优于现有方法。

Insight: 生成模型(如LLMs)在流式视频理解中的集成是未来方向,OpenHOUSE是这一方向的关键一步。

Abstract: We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets. We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods. We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.

[143] Multi Anatomy X-Ray Foundation Model

Nishank Singla,Krisztian Koos,Farzin Haddadpour,Amin Honarmandi Shandiz,Lovish Chum,Xiaojian Xu,Qing Jin,Erhan Bas

Main category: cs.CV

TL;DR: TL;DR: XR-0是一个多解剖部位的X射线基础模型,通过自监督学习在大规模数据集上训练,覆盖多种解剖区域,并在多个下游任务中实现了最先进性能。

Details Motivation: 现有的大多数AI基础模型仅限于胸部解剖,无法泛化到更广泛的临床任务。为了解决这一问题,作者提出了一个多解剖部位的X射线基础模型,以提高模型的通用性和鲁棒性。

Contribution: 主要贡献包括:1) 提出了XR-0,一个多解剖部位的X射线基础模型;2) 在涵盖多种解剖区域的私有数据集上进行了自监督学习;3) 在12个数据集和20个下游任务中进行了全面评估,展示了其卓越性能。

Method: 方法包括:1) 使用自监督学习在大规模数据集(115万张图像)上训练模型;2) 覆盖多种解剖区域以增强多样性;3) 在多种任务(分类、检索、分割、定位、视觉基础和报告生成)中进行了评估。

Result: XR-0在大多数多解剖任务中实现了最先进性能,同时在胸部特定基准测试中保持竞争力。

Insight: 解剖多样性和监督信号对于构建鲁棒的、通用的医学视觉模型至关重要,为放射学中可扩展和适应性强的AI系统奠定了基础。

Abstract: X-ray imaging is a ubiquitous in radiology, yet most existing AI foundation models are limited to chest anatomy and fail to generalize across broader clinical tasks. In this work, we introduce XR-0, the multi-anatomy X-ray foundation model using self-supervised learning on a large, private dataset of 1.15 million images spanning diverse anatomical regions and evaluated across 12 datasets and 20 downstream tasks, including classification, retrieval, segmentation, localization, visual grounding, and report generation. XR-0 achieves state-of-the-art performance on most multi-anatomy tasks and remains competitive on chest-specific benchmarks. Our results demonstrate that anatomical diversity and supervision are critical for building robust, general-purpose medical vision models, paving the way for scalable and adaptable AI systems in radiology.

[144] LoRA-fine-tuned Large Vision Models for Automated Assessment of Post-SBRT Lung Injury

M. Bolhassani,B. Veasey,E. Daugherty,S. Keltner,N. Kumar,N. Dunlap,A. Amini

Main category: cs.CV

TL;DR: 这篇论文研究了使用LoRA(低秩适应)对大型视觉模型(DinoV2和SwinV2)进行微调,以诊断SBRT后的放射诱导肺损伤(RILI)。相比传统的全微调和仅推理方法,LoRA在性能相当或更优的同时显著降低了计算成本和训练时间。

Details Motivation: 研究旨在评估LoRA在微调大型视觉模型中的效果,特别是在医学影像诊断中,如何高效地实现高性能且减少计算资源的需求。

Contribution: 主要贡献包括验证了LoRA在医学影像诊断任务中的有效性,以及展示了其在减少计算成本和训练时间方面的优势。

Method: 使用了DinoV2和SwinV2两种大型视觉模型,采用LoRA技术进行微调,并与传统全微调和仅推理方法进行比较。实验基于不同尺寸的裁剪图像(50 mm3和75 mm3)和适应技术。

Result: 实验结果表明,LoRA在性能上与全微调相当或更优,同时显著减少了计算资源和训练时间。

Insight: LoRA是一种高效的替代方案,适用于资源受限的医学影像诊断任务,同时保持模型性能。

Abstract: This study investigates the efficacy of Low-Rank Adaptation (LoRA) for fine-tuning large Vision Models, DinoV2 and SwinV2, to diagnose Radiation-Induced Lung Injury (RILI) from X-ray CT scans following Stereotactic Body Radiation Therapy (SBRT). To evaluate the robustness and efficiency of this approach, we compare LoRA with traditional full fine-tuning and inference-only (no fine-tuning) methods. Cropped images of two sizes (50 mm3 and 75 mm3), centered at the treatment isocenter, in addition to different adaptation techniques for adapting the 2D LVMs for 3D data were used to determine the sensitivity of the models to spatial context. Experimental results show that LoRA achieves comparable or superior performance to traditional fine-tuning while significantly reducing computational costs and training times by requiring fewer trainable parameters.

[145] HoloGarment: 360° Novel View Synthesis of In-the-Wild Garments

Johanna Karras,Yingwei Li,Yasamin Jafarian,Ira Kemelmacher-Shlizerman

Main category: cs.CV

TL;DR: HoloGarment提出了一种新方法,通过结合真实视频数据和合成3D数据,生成360度视角的服装新视图,解决了真实场景中服装遮挡、复杂姿态等问题。

Details Motivation: 现有方法依赖合成数据,难以泛化到真实世界的服装场景,尤其是在遮挡、复杂姿态和布料变形的情况下。HoloGarment旨在填补这一领域差距。

Contribution: 1. 提出了一种结合大规模真实视频数据和小规模合成3D数据的隐式训练范式。2. 通过共享嵌入空间,实现了动态视频到360度新视图的生成。3. 引入了服装’地图集’表示,捕捉多视角下的几何和纹理信息。

Method: HoloGarment通过共享嵌入空间优化服装表示,并利用服装地图集提取多视角信息,从而实现从少量输入(1-3张图像或视频)生成360度视角的合成视图。

Result: 实验表明,HoloGarment在真实服装的新视图生成任务中表现优于现有方法,能够鲁棒地处理褶皱、姿态变化和遮挡等问题,同时保持逼真性和一致性。

Insight: 通过结合真实和合成数据,可以有效解决真实世界中的复杂问题,共享嵌入空间的引入为多模态数据融合提供了新思路。

Abstract: Novel view synthesis (NVS) of in-the-wild garments is a challenging task due significant occlusions, complex human poses, and cloth deformations. Prior methods rely on synthetic 3D training data consisting of mostly unoccluded and static objects, leading to poor generalization on real-world clothing. In this paper, we propose HoloGarment (Hologram-Garment), a method that takes 1-3 images or a continuous video of a person wearing a garment and generates 360{\deg} novel views of the garment in a canonical pose. Our key insight is to bridge the domain gap between real and synthetic data with a novel implicit training paradigm leveraging a combination of large-scale real video data and small-scale synthetic 3D data to optimize a shared garment embedding space. During inference, the shared embedding space further enables dynamic video-to-360{\deg} NVS through the construction of a garment “atlas” representation by finetuning a garment embedding on a specific real-world video. The atlas captures garment-specific geometry and texture across all viewpoints, independent of body pose or motion. Extensive experiments show that HoloGarment achieves state-of-the-art performance on NVS of in-the-wild garments from images and videos. Notably, our method robustly handles challenging real-world artifacts – such as wrinkling, pose variation, and occlusion – while maintaining photorealism, view consistency, fine texture details, and accurate geometry. Visit our project page for additional results: https://johannakarras.github.io/HoloGarment

[146] Domain-Adaptive Pretraining Improves Primate Behavior Recognition

Felix B. Mueller,Timo Lueddecke,Richard Vogg,Alexander S. Ecker

Main category: cs.CV

TL;DR: 该论文提出了一种通过域自适应预训练(DAP)提升灵长类动物行为识别的方法,利用自监督学习和未标记数据显著提高了动作识别性能。

Details Motivation: 动物行为研究的计算机视觉工具需要高效的学习方法,以应对大规模数据标记的高成本问题。

Contribution: 提出了域自适应预训练(DAP)方法,在无需标记样本的情况下显著提升了灵长类动物行为识别的性能。

Method: 使用预训练的V-JEPA模型并通过域自适应预训练(DAP)进行微调,利用未标记的领域内数据继续预训练。

Result: 在PanAf和ChimpACT两个数据集上,分别以6.1%的准确率和6.3%的mAP超越现有最佳动作识别模型。

Insight: 域自适应预训练(DAP)能够显著提升模型性能,尤其是在未标记数据领域具有广泛应用潜力。

Abstract: Computer vision for animal behavior offers promising tools to aid research in ecology, cognition, and to support conservation efforts. Video camera traps allow for large-scale data collection, but high labeling costs remain a bottleneck to creating large-scale datasets. We thus need data-efficient learning approaches. In this work, we show that we can utilize self-supervised learning to considerably improve action recognition on primate behavior. On two datasets of great ape behavior (PanAf and ChimpACT), we outperform published state-of-the-art action recognition models by 6.1 %pt. accuracy and 6.3 %pt. mAP, respectively. We achieve this by utilizing a pretrained V-JEPA model and applying domain-adaptive pretraining (DAP), i.e. continuing the pretraining with in-domain data. We show that most of the performance gain stems from the DAP. Our method promises great potential for improving the recognition of animal behavior, as DAP does not require labeled samples. Code is available at https://github.com/ecker-lab/dap-behavior

[147] OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Yang Zhou,Yifan Wang,Jianjun Zhou,Wenzheng Chang,Haoyu Guo,Zizun Li,Kaijing Ma,Xinyue Li,Yating Wang,Haoyi Zhu,Mingyu Liu,Dingning Liu,Jiange Yang,Zhoujie Fu,Junyi Chen,Chunhua Shen,Jiangmiao Pang,Kaipeng Zhang,Tong He

Main category: cs.CV

TL;DR: OmniWorld是一个为4D世界建模设计的大规模、多领域、多模态数据集,填补了现有数据在动态复杂性、多样性和时空注释上的不足。

Details Motivation: 现有4D世界建模数据缺乏动态复杂性、多领域多样性和时空注释,限制了通用4D世界模型的发展。

Contribution: 提出了OmniWorld数据集,包含新收集的OmniWorld-Game和多个公开数据集,支持4D重建、未来预测和相机控制视频生成等任务。

Method: 整合多领域、多模态数据,建立具有丰富动态交互和时空注释的数据集,并基于此提出基准测试。

Result: OmniWorld显著提升了现有SOTA方法在4D重建和视频生成任务上的性能。

Insight: 高质量的数据集是推动通用4D世界模型发展的关键,OmniWorld为此提供了重要资源。

Abstract: The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines’ holistic understanding of the physical world.

[148] LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Zixin Yin,Xili Dai,Duomin Wang,Xianfang Zeng,Lionel M. Ni,Gang Yu,Heung-Yeung Shum

Main category: cs.CV

TL;DR: LazyDrag 提出了一种基于显式对应点的拖拽编辑方法,解决了多模态扩散变换器中隐式点匹配的问题,实现了稳定的全强度反演,无需测试时优化。

Details Motivation: 当前基于拖拽的图像编辑方法依赖隐式点匹配,导致生成能力受限、性能不稳定。LazyDrag 旨在通过显式对应点提升注意力控制,解决这些问题。

Contribution: 1. 首个针对多模态扩散变换器的显式对应点拖拽编辑方法;
2. 实现了稳定的全强度反演,无需测试时优化;
3. 统一了几何控制和文本引导,支持复杂编辑任务。

Method: 通过用户拖拽输入生成显式对应图,作为注意力控制的可靠参考,从而增强反演过程的稳定性。

Result: 在 DragBench 上表现优于基线,拖拽准确率和感知质量均有所提升,并在 VIEScore 和人工评估中得到验证。

Insight: 显式对应点能够有效避免隐式匹配的瓶颈,为扩散模型的编辑任务开辟了新范式。

Abstract: The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball’’, or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

[149] Character-Centric Understanding of Animated Movies

Zhongrui Gui,Junyu Xie,Tengda Han,Weidi Xie,Andrew Zisserman

Main category: cs.CV

TL;DR: 该论文提出了一种基于视听多模态的动画角色识别方法,并通过构建角色库支持下游应用,如为视障和听障观众生成音频描述和角色感知字幕。

Details Motivation: 动画电影中的角色设计多样且复杂,传统的人脸识别方法难以应对其外观、运动和变形的极端多样性。因此,需要一个更鲁棒的方法来理解和识别动画角色。

Contribution: 1. 提出了一种视听多模态角色识别方法;2. 构建了一个动画角色库(音频和视觉样本);3. 引入了CMD-AM数据集;4. 探索了下游应用(音频描述和角色感知字幕)。

Method: 1. 自动从在线资源构建视听角色库;2. 利用多模态信息(视觉和音频)进行角色识别;3. 应用于AD生成和角色感知字幕。

Result: 该方法在动画内容的理解和可访问性方面显著优于基于传统人脸检测的方法。

Insight: 多模态信息(尤其是音频)在角色识别中扮演重要角色,特别是在动画角色的极端多样性情况下。

Abstract: Animated movies are captivating for their unique character designs and imaginative storytelling, yet they pose significant challenges for existing recognition systems. Unlike the consistent visual patterns detected by conventional face recognition methods, animated characters exhibit extreme diversity in their appearance, motion, and deformation. In this work, we propose an audio-visual pipeline to enable automatic and robust animated character recognition, and thereby enhance character-centric understanding of animated movies. Central to our approach is the automatic construction of an audio-visual character bank from online sources. This bank contains both visual exemplars and voice (audio) samples for each character, enabling subsequent multi-modal character recognition despite long-tailed appearance distributions. Building on accurate character recognition, we explore two downstream applications: Audio Description (AD) generation for visually impaired audiences, and character-aware subtitling for the hearing impaired. To support research in this domain, we introduce CMD-AM, a new dataset of 75 animated movies with comprehensive annotations. Our character-centric pipeline demonstrates significant improvements in both accessibility and narrative comprehension for animated content over prior face-detection-based approaches. For the code and dataset, visit https://www.robots.ox.ac.uk/~vgg/research/animated_ad/.

cs.HC [Back]

[150] Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

Tae Soo Kim,Heechan Lee,Yoonjoo Lee,Joseph Seering,Juho Kim

Main category: cs.HC

TL;DR: 论文提出了一种名为Evalet的系统,通过将大语言模型(LLM)的输出分解为关键片段并分析每个片段的修辞功能,从而揭示评估中的具体影响因素,帮助用户更精细地分析和比较模型行为。

Details Motivation: 当前LLM评估方法通常依赖整体评分,难以识别具体影响评估的元素,导致用户难以验证结果和发现实际问题。

Contribution: 提出了功能性分解方法(functional fragmentation),并将其实现为Evalet系统,支持交互式可视化分析片段的修辞功能。

Method: 将LLM输出分解为关键片段,分析每个片段的功能与评估标准的关联,并提供可视化工具支持用户检查和比较。

Result: 用户研究显示,Evalet帮助用户识别出48%额外的评估偏差,增强了他们对LLM评估的信任,并能发现更多可操作性问题。

Insight: 通过分解和可视化片段功能,LLM评估可以从量化评分转向更细粒度的定性分析,提升评估的透明度和实用性。

Abstract: Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through “LLM-as-a-Judge” approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria – surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

cs.CR [Back]

[151] LLM in the Middle: A Systematic Review of Threats and Mitigations to Real-World LLM-based Systems

Vitor Hugo Galhardo Moia,Igor Jochem Sanz,Gabriel Antonio Fontes Rebello,Rodrigo Duarte de Meneses,Briland Hitaj,Ulf Lindqvist

Main category: cs.CR

TL;DR: 该论文对现实世界中基于LLM的系统的威胁与防御策略进行了系统性综述,涵盖了从开发到运营的各个阶段,为开发者和研究人员提供了全面的风险分析和缓解方案。

Details Motivation: 随着生成式AI(尤其是大语言模型)的广泛应用,其面临的安全和隐私威胁日益增多,亟需系统性的研究和分类。

Contribution: 论文的主要贡献是对基于LLM的系统的威胁进行了全面分类,并提出了针对性的防御策略,覆盖了整个软件和LLM的生命周期。

Method: 通过系统性综述和分类,分析了从开发到运营的不同场景中的威胁,并根据严重程度和适用场景对其进行了分类。

Result: 研究结果揭示了最相关的威胁,并提供了系统化的防御策略分类,便于开发者和组织高效地识别和缓解风险。

Insight: 论文强调了LLM集成中的安全挑战,并讨论了开放性问题,为未来研究提供了方向。

Abstract: The success and wide adoption of generative AI (GenAI), particularly large language models (LLMs), has attracted the attention of cybercriminals seeking to abuse models, steal sensitive data, or disrupt services. Moreover, providing security to LLM-based systems is a great challenge, as both traditional threats to software applications and threats targeting LLMs and their integration must be mitigated. In this survey, we shed light on security and privacy concerns of such LLM-based systems by performing a systematic review and comprehensive categorization of threats and defensive strategies considering the entire software and LLM life cycles. We analyze real-world scenarios with distinct characteristics of LLM usage, spanning from development to operation. In addition, threats are classified according to their severity level and to which scenarios they pertain, facilitating the identification of the most relevant threats. Recommended defense strategies are systematically categorized and mapped to the corresponding life cycle phase and possible attack strategies they attenuate. This work paves the way for consumers and vendors to understand and efficiently mitigate risks during integration of LLMs in their respective solutions or organizations. It also enables the research community to benefit from the discussion of open challenges and edge cases that may hinder the secure and privacy-preserving adoption of LLM-based systems.

[152] Realistic Environmental Injection Attacks on GUI Agents

Yitong Zhang,Ximo Li,Liyi Cai,Jia Li

Main category: cs.CR

TL;DR: 该论文提出了一种更现实的环境注入攻击(EIAs)威胁模型,攻击者为普通用户且触发图像小而动态嵌入网页中。现有攻击在此模型下效果不佳,因此作者提出了Chameleon攻击框架,包含LLM驱动的环境模拟和注意力黑洞两大创新点,显著优于现有方法。

Details Motivation: 现代GUI代理基于LVLMs构建,但其开放世界交互特性使其易受环境注入攻击。现有研究假设攻击者为普通用户且仅能上传单张触发图像,但仍未完全捕捉真实网页的动态性和触发性图像的实际大小。

Contribution: 1. 提出了更现实的EIA威胁模型;2. 开发了Chameleon攻击框架,包含LLM驱动的环境模拟和注意力黑洞技术;3. 在6个真实网站和4种GUI代理上验证了其有效性。

Method: Chameleon框架的核心为:(1) LLM驱动的环境模拟,生成多样化高保真网页仿真;(2) 注意力黑洞技术,通过显式监督信号引导代理关注触发区域。

Result: Chameleon在真实网站和GUI代理上显著优于现有方法,消融实验验证了两大创新点的必要性。

Insight: 现代GUI代理存在未被充分探索的漏洞,Chameleon为未来的开放世界GUI代理防御研究提供了基础。

Abstract: GUI agents built on LVLMs are increasingly used to interact with websites. However, their exposure to open-world content makes them vulnerable to Environmental Injection Attacks (EIAs) that hijack agent behavior via webpage elements. Many recent studies assume the attacker to be a regular user who can only upload a single trigger image, which is more realistic than earlier assumptions of website-level administrative control. However, these works still fall short of realism: (1) the trigger’s position and surrounding context remain largely fixed between training and testing, failing to capture the dynamic nature of real webpages and (2) the trigger often occupies an unrealistically large area, whereas real-world images are typically small. To better reflect real-world scenarios, we introduce a more realistic threat model where the attacker is a regular user and the trigger image is small and embedded within a dynamically changing environment. As a result, existing attacks prove largely ineffective under this threat model. To better expose the vulnerabilities of GUI agents, we propose Chameleon, an attack framework with two main novelties. The first is LLM-Driven Environment Simulation, which automatically generates diverse and high-fidelity webpage simulations. The second is Attention Black Hole, which transforms attention weights into explicit supervisory signals that guide the agent’s focus toward the trigger region. We evaluate Chameleon on 6 realistic websites and 4 representative LVLM-powered GUI agents, where it significantly outperforms existing methods. Ablation studies confirm that both novelties are critical to performance. Our findings reveal underexplored vulnerabilities in modern GUI agents and establish a robust foundation for future research on defense in open-world GUI agent systems. The code is publicly available at https://github.com/zhangyitonggg/attack2gui.

cs.CY [Back]

[153] Smart Trial: Evaluating the Use of Large Language Models for Recruiting Clinical Trial Participants via Social Media

Xiaofan Zhou,Zisu Wang,Janice Krieger,Mohan Zalake,Lu Cheng

Main category: cs.CY

TL;DR: 论文探讨了利用大型语言模型(LLM)通过社交媒体招募临床试验参与者的可行性,提出了TRIALQA数据集,并研究了七种主流LLM在评估参与者资格方面的表现。

Details Motivation: 临床试验受试者招募效率低且受地域限制,传统方法耗时且效果有限。研究希望通过社交媒体数据和LLM能力解决这一问题。

Contribution: 1. 提出了专门用于临床试验招募研究的TRIALQA数据集;2. 评估了七种LLM在复杂推理任务中的表现。

Method: 1. 构建TRIALQA数据集,包含社交媒体用户数据和注释的资格标注;2. 使用六种训练和推理策略对七种LLM进行评测。

Result: 实验表明,LLM在评估复杂资格标准时表现出潜力,但仍需改进多步推理能力。

Insight: LLM在临床试验招募中具有前景,但需要进一步提升对复杂文本的理解和推理能力。

Abstract: Clinical trials (CT) are essential for advancing medical research and treatment, yet efficiently recruiting eligible participants – each of whom must meet complex eligibility criteria – remains a significant challenge. Traditional recruitment approaches, such as advertisements or electronic health record screening within hospitals, are often time-consuming and geographically constrained. This work addresses the recruitment challenge by leveraging the vast amount of health-related information individuals share on social media platforms. With the emergence of powerful large language models (LLMs) capable of sophisticated text understanding, we pose the central research question: Can LLM-driven tools facilitate CT recruitment by identifying potential participants through their engagement on social media? To investigate this question, we introduce TRIALQA, a novel dataset comprising two social media collections from the subreddits on colon cancer and prostate cancer. Using eligibility criteria from public real-world CTs, experienced annotators are hired to annotate TRIALQA to indicate (1) whether a social media user meets a given eligibility criterion and (2) the user’s stated reasons for interest in participating in CT. We benchmark seven widely used LLMs on these two prediction tasks, employing six distinct training and inference strategies. Our extensive experiments reveal that, while LLMs show considerable promise, they still face challenges in performing the complex, multi-hop reasoning needed to accurately assess eligibility criteria.

cs.MA [Back]

[154] MALLM: Multi-Agent Large Language Models Framework

Jonas Becker,Lars Benedikt Kaesberg,Niklas Bauer,Jan Philip Wahle,Terry Ruas,Bela Gipp

Main category: cs.MA

TL;DR: MALLM是一个开源的多智能体大语言模型框架,专注于系统化分析多智能体辩论(MAD)的组件,提供144种以上独特的配置选项,支持研究者和开发者灵活探索MAD的不同方面。

Details Motivation: 当前的多智能体辩论框架在工具使用、集成评估和配置灵活性方面存在局限,无法充分挖掘MAD的潜力。MALLM旨在提供一个系统化的分析工具,帮助研究者深入理解MAD的组件及其相互作用。

Contribution: MALLM主要贡献包括:(1)提供144种以上独特的MAD配置选项;(2)支持加载文本数据集并集成了评估流程;(3)通过简单的配置文件定义辩论,使研究更灵活和可扩展。

Method: MALLM通过配置文件定义辩论,支持多种代理角色(如专家、个性化)、响应生成器(如批判性、推理)、讨论范式(如记忆、传递)和决策协议(如投票、共识)。框架还集成了Huggingface数据集和评估工具。

Result: MALLM提供了一个灵活且可扩展的研究工具,能够帮助研究者系统化分析MAD的各个方面,促进对多智能体辩论的更深入理解。

Insight: MALLM的设计突出了配置灵活性和系统化分析的重要性,为多智能体辩论的研究提供了新的实验平台。其开源特性也有助于社区的广泛协作和扩展。

Abstract: Multi-agent debate (MAD) has demonstrated the ability to augment collective intelligence by scaling test-time compute and leveraging expertise. Current frameworks for multi-agent debate are often designed towards tool use, lack integrated evaluation, or provide limited configurability of agent personas, response generators, discussion paradigms, and decision protocols. We introduce MALLM (Multi-Agent Large Language Models), an open-source framework that enables systematic analysis of MAD components. MALLM offers more than 144 unique configurations of MAD, including (1) agent personas (e.g., Expert, Personality), (2) response generators (e.g., Critical, Reasoning), (3) discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g., Voting, Consensus). MALLM uses simple configuration files to define a debate. Furthermore, MALLM can load any textual Huggingface dataset (e.g., MMLU-Pro, WinoGrande) and provides an evaluation pipeline for easy comparison of MAD configurations. MALLM is tailored towards researchers and provides a window into the heart of multi-agent debate, facilitating the understanding of its components and their interplay.

eess.SP [Back]

[155] RadarLLM: Adapting Pretrained Large Language Models for Marine Radar Target Detection with Preference-aware Loss

Qiying Hu

Main category: eess.SP

TL;DR: RadarLLM提出了一种基于偏好感知损失的方法,通过微调预训练大语言模型(LLMs)来解决海洋雷达目标检测任务中的过拟合问题,尤其在低信噪比(SCR)场景下表现突出。

Details Motivation: 预训练的大语言模型(LLMs)在通用知识捕获方面表现出色,但在直接微调用于海洋雷达目标检测时容易过拟合,尤其是在低信噪比(SCR)场景下。

Contribution: 1. 提出RadarLLM框架,首次将预训练LLMs应用于海洋雷达目标检测任务;2. 设计了一种偏好感知损失函数,选择性优化特征令牌以减少过拟合。

Method: 通过偏好感知损失函数,根据在线评估的学习价值选择性优化不同特征块,引导模型关注最具泛化性的特征模式。

Result: 在真实海洋雷达数据集上实验表明,RadarLLM在低SCR场景下表现显著优于基线方法,尤其在训练数据有限时优势明显。

Insight: 选择性优化特征令牌可以有效减少过拟合,提升模型在挑战性场景(如低SCR)下的泛化能力。

Abstract: Recent advances in pre-trained large language models (LLMs) have demonstrated their capacities to capture universal knowledge, making them promising general-purpose optimization solvers for wireless signal processing. Motivated by these findings, we take the first step towards fine-tuning pre-trained LLMs for the effective analysis of radar signal features in marine target detection tasks. Nevertheless, directly fine-tuning pre-trained LLMs on marine target detection tasks tends to suffer from pronounced overfitting, particularly in challenging low signal-to-clutter ratio (SCR) scenarios. This overfitting primarily stems from the model’s tendency to memorize spurious or noisy feature patterns rather than learning discriminative structures that generalize well to unseen data. To address this challenge, we introduce RadarLLM, a novel fine-tuning framework that utilizes an effective preference-aware loss. Unlike conventional training strategies that uniformly optimize all feature tokens, this loss function selectively optimizes different feature patches based on their online evaluated learning values, thus guiding the model to focus on the most generalizable patterns during optimization. We theoretically demonstrate the effectiveness of the evaluated learning values by transforming the problem as selecting useful feature tokens. Extensive experiments on real-world marine radar datasets show that 1) the proposed loss function is much better than the original one, with particularly significant gains in challenging low SCR scenarios and 2) RadarLLM consistently outperforms state-of-the-art baselines across diverse detection scenarios, with particularly notable gains under limited training data conditions.

[156] When marine radar target detection meets pretrained large language models

Qiying Hu,Linping Zhang,Xueqian Wang,Gang Li,Yu Liu,Xiao-Ping Zhang

Main category: eess.SP

TL;DR: 该论文提出了一种结合雷达序列特征预处理与预训练大语言模型(LLM)的框架,用于提升海洋雷达目标检测性能,显著优于现有基线方法。

Details Motivation: 传统深度学习算法在雷达回波信号序列特征提取中面临冗余特征段和模型尺寸受限的挑战,亟需一种高效的处理方法。

Contribution: 1. 提出一种将雷达序列特征预处理与预训练LLM结合的框架;2. 提出基于补丁选择的特征过滤方法;3. 通过仅微调归一化层降低计算负担。

Method: 1. 雷达序列特征通过分词和补丁选择算法过滤冗余;2. 将补丁投影为与LLM兼容的嵌入;3. 微调预先LLM的归一化层。

Result: 实验结果表明,该方法在监督学习任务中显著优于现有基线方法。

Insight: 1. 预处理和LLM结合可有效提升特征质量;2. 微调归一化层是高效利用预训练模型的实用策略。

Abstract: Deep learning (DL) methods are widely used to extract high-dimensional patterns from the sequence features of radar echo signals. However, conventional DL algorithms face challenges such as redundant feature segments, and constraints from restricted model sizes. To address these issues, we propose a framework that integrates feature preprocessing with large language models (LLMs). Our preprocessing module tokenizes radar sequence features, applies a patch selection algorithm to filter out uninformative segments, and projects the selected patches into embeddings compatible with the feature space of pre-trained LLMs. Leveraging these refined embeddings, we incorporate a pre-trained LLM, fine-tuning only the normalization layers to reduce training burdens while enhancing performance. Experiments on measured datasets demonstrate that the proposed method significantly outperforms the state-of-the-art baselines on supervised learning tests.

cs.IT [Back]

[157] Rate-Distortion Limits for Multimodal Retrieval: Theory, Optimal Codes, and Finite-Sample Guarantees

Thomas Y. Chen

Main category: cs.IT

TL;DR: 该论文首次建立了多模态检索的信息理论极限,通过将排序问题建模为有损源编码,提出了率失真函数理论,并设计了自适应熵加权量化器,实验验证了其接近理论极限的性能。

Details Motivation: 多模态检索(如跨模态搜索)缺乏理论指导,尤其是在比特率与失真之间的权衡问题上。论文旨在填补这一空白,为高效的多模态检索提供理论基础和实用方法。

Contribution: 1. 提出了多模态检索的率失真理论;2. 设计了熵加权的随机量化器;3. 证明了有限样本下复杂度与模态数和熵差距的亚线性关系。

Method: 1. 将排序建模为有损源编码,推导单字母率失真函数;2. 构造熵加权随机量化器,结合自适应温度解码器;3. 使用Blahut-Arimoto方法优化性能。

Result: 实验表明,自适应量化器的性能接近理论极限(仅差2个百分点),显著优于固定温度方法和CLIP基线。

Insight: 多模态检索的性能受模态间熵不平衡和冗余影响,熵加权的自适应设计能有效提升效率。

Abstract: We establish the first information-theoretic limits for multimodal retrieval. Casting ranking as lossy source coding, we derive a single-letter rate-distortion function $R(D)$ for reciprocal-rank distortion and prove a converse bound that splits into a modality-balanced term plus a skew penalty $\kappa,\Delta H$ capturing entropy imbalance and cross-modal redundancy. We then construct an explicit entropy-weighted stochastic quantizer with an adaptive, per-modality temperature decoder; a Blahut-Arimoto argument shows this scheme achieves distortion within $O(n^{-1})$ of $R(D)$ using $n$ training triples. A VC-type analysis yields the first finite-sample excess-risk bound whose complexity scales sub-linearly in both the number of modalities and the entropy gap. Experiments on controlled Gaussian mixtures and Flickr30k confirm that our adaptive codes sit within two percentage points of the theoretical frontier, while fixed-temperature and naive CLIP baselines lag significantly. Taken together, our results give a principled answer to “how many bits per query are necessary” for high-quality multimodal retrieval and provide design guidance for entropy-aware contrastive objectives, continual-learning retrievers, and retrieval-augmented generators.

cs.SD [Back]

[158] FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Md Mubtasim Ahasan,Rafat Hasan Khan,Tasnim Mohiuddin,Aman Chadha,Tariq Iqbal,M Ashraful Amin,Amin Ahsan Ali,Md Mofijul Islam,A K M Mahbubur Rahman

Main category: cs.SD

TL;DR: FuseCodec通过跨模态对齐和全局监督,统一了语音的声学、语义和上下文表示,显著提升了语音编码的性能。

Details Motivation: 现有神经编解码器仅关注低层次声学特征,忽略了语音中的语义和上下文信息,导致表示不完整。

Contribution: 提出FuseCodec,通过三种技术(潜在表示融合、全局监督和时序对齐监督)统一多模态表示,并展示了在零样本语音合成中的应用。

Method: 结合潜在表示融合、全局语义-上下文监督和时序对齐监督,实现多模态对齐与统一。

Result: 在LibriSpeech上超越EnCodec、SpeechTokenizer和DAC,在转录准确性、感知质量、清晰度和说话人相似性上达到SOTA。

Insight: 语义和上下文信息的引入显著提升了语音编码的性能,证明了多模态融合在语音任务中的重要性。

Abstract: Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology’s applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.

[159] Spectral and Rhythm Features for Audio Classification with Deep Convolutional Neural Networks

Friedrich Wolf-Monheim

Main category: cs.SD

TL;DR: 论文研究了不同频谱和节奏特征(如mel-scaled谱图、MFCCs等)在音频分类任务中的表现,发现mel-scaled谱图和MFCCs在深层CNN中表现最佳。

Details Motivation: 探索CNN在音频分类任务中的应用,验证不同频谱和节奏特征的有效性。

Contribution: 评估了多种频谱和节奏特征的性能,确定了mel-scaled谱图和MFCCs作为音频分类的最优特征。

Method: 使用深层CNN对不同特征(mel-scaled谱图、MFCCs等)进行分类实验,并比较其表现。

Result: mel-scaled谱图和MFCCs的分类性能显著优于其他频谱和节奏特征。

Insight: 频谱特征的提取方式对音频分类任务至关重要,CNN在音频领域的应用潜力巨大。

Abstract: Convolutional neural networks (CNNs) are widely used in computer vision. They can be used not only for conventional digital image material to recognize patterns, but also for feature extraction from digital imagery representing spectral and rhythm features extracted from time-domain digital audio signals for the acoustic classification of sounds. Different spectral and rhythm feature representations like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCCs), cyclic tempograms, short-time Fourier transform (STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy normalized statistics (CENS) chromagrams are investigated in terms of the audio classification performance using a deep convolutional neural network. It can be clearly shown that the mel-scaled spectrograms and the mel-frequency cepstral coefficients (MFCCs) perform significantly better than the other spectral and rhythm features investigated in this research for audio classification tasks using deep CNNs. The experiments were carried out with the aid of the ESC-50 dataset with 2,000 labeled environmental audio recordings.

eess.AS [Back]

[160] Spectral Bottleneck in Deep Neural Networks: Noise is All You Need

Hemanth Chandravamsi,Dhanush V. Shenoy,Itay Zinn,Shimon Pisnoy,Steven H. Frankel

Main category: eess.AS

TL;DR: 论文提出了一种称为WINNER的权重扰动方案,用于解决深度神经网络在拟合高频主导信号时的谱瓶颈问题,通过自适应噪声扰动权重初始化,提升表示精度和收敛速度。

Details Motivation: 深度神经网络在拟合高频主导信号时存在谱瓶颈问题,导致模型难以重建目标信号的高频成分。论文旨在解决这一问题,并提出一种通用的初始化策略。

Contribution: 提出了一种基于目标信号谱特性的自适应权重扰动方案(WINNER),通过调整噪声尺度控制网络激活谱和神经正切核的特征基,有效解决谱瓶颈问题。

Method: 使用高斯噪声扰动均匀初始化的权重,噪声尺度由目标信号的谱质心自适应确定,从而调整网络激活的频谱特性。

Result: WINNER在音频拟合任务中优于现有方法,在图像拟合和去噪任务中取得显著改进,同时加快了收敛速度。

Insight: 噪声注入的权重初始化策略可以动态调整网络行为,为计算机视觉和科学机器学习中的自适应初始化提供了新思路。

Abstract: Deep neural networks are known to exhibit a spectral learning bias, wherein low-frequency components are learned early in training, while high-frequency modes emerge more gradually in later epochs. However, when the target signal lacks low-frequency components and is dominated by broadband high frequencies, training suffers from a ‘spectral bottleneck’, and the model fails to reconstruct the entire signal, including the frequency components that lie within the network’s representational capacity. We examine such a scenario in the context of implicit neural representations (INRs) with sinusoidal representation networks (SIRENs), focusing on the challenge of fitting high-frequency-dominant signals that are susceptible to spectral bottleneck. To effectively fit any target signal irrespective of it’s frequency content, we propose a generalized target-aware ‘weight perturbation scheme’ (WINNER - weight initialization with noise for neural representations) for network initialization. The scheme perturbs uniformly initialized weights with Gaussian noise, where the noise scales are adaptively determined by the spectral centroid of the target signal. We show that the noise scales can provide control over the spectra of network activations and the eigenbasis of the empirical neural tangent kernel. This method not only addresses the spectral bottleneck but also yields faster convergence and with improved representation accuracy, outperforming state-of-the-art approaches in audio fitting and achieving notable gains in image fitting and denoising tasks. Beyond signal reconstruction, our approach opens new directions for adaptive weight initialization strategies in computer vision and scientific machine learning.

cs.AI [Back]

[161] Understanding AI Evaluation Patterns: How Different GPT Models Assess Vision-Language Descriptions

Sajjad Abdoli,Rudi Cilibrasi,Rima Al-Shikh

Main category: cs.AI

TL;DR: 本文通过分析NVIDIA的Describe Anything Model生成的视觉语言描述,探讨了三种GPT模型(GPT-4o、GPT-4o-mini、GPT-5)的评估行为,揭示了它们在评估策略和偏见上的差异,并强调了评估能力的多样性对AI评估的重要性。

Details Motivation: 随着AI系统越来越多地评估其他AI的输出,了解其评估行为变得至关重要,以防止偏见的传递。本文旨在揭示不同GPT模型的评估行为及其潜在的策略和偏见。

Contribution: 本文的主要贡献包括:(1)揭示了三种GPT模型在评估视觉语言描述时表现出的不同“评估个性”;(2)通过实验验证这些个性是模型固有的特性;(3)发现GPT模型存在一致的负面评估偏好;(4)提出评估能力的多样性对AI评估的重要性。

Method: 作者使用了NVIDIA的Describe Anything Model生成视觉语言描述,并让三种GPT模型(GPT-4o、GPT-4o-mini、GPT-5)进行评估。此外,通过Gemini 2.5 Pro作为独立的提问生成器进行验证实验。

Result: 实验结果显示,GPT-4o-mini表现出一致性但缺乏变化性,GPT-4o擅长错误检测,而GPT-5则表现出高度保守和变异性。所有GPT模型均表现出2:1的负面评估偏好。此外,GPT模型与Gemini的评估策略差异显著。

Insight: 研究发现,评估能力并不随整体性能提升而线性增强,且AI评估需要多样化的架构视角以减少偏见并提高鲁棒性。

Abstract: As AI systems increasingly evaluate other AI outputs, understanding their assessment behavior becomes crucial for preventing cascading biases. This study analyzes vision-language descriptions generated by NVIDIA’s Describe Anything Model and evaluated by three GPT variants (GPT-4o, GPT-4o-mini, GPT-5) to uncover distinct “evaluation personalities” the underlying assessment strategies and biases each model demonstrates. GPT-4o-mini exhibits systematic consistency with minimal variance, GPT-4o excels at error detection, while GPT-5 shows extreme conservatism with high variability. Controlled experiments using Gemini 2.5 Pro as an independent question generator validate that these personalities are inherent model properties rather than artifacts. Cross-family analysis through semantic similarity of generated questions reveals significant divergence: GPT models cluster together with high similarity while Gemini exhibits markedly different evaluation strategies. All GPT models demonstrate a consistent 2:1 bias favoring negative assessment over positive confirmation, though this pattern appears family-specific rather than universal across AI architectures. These findings suggest that evaluation competence does not scale with general capability and that robust AI assessment requires diverse architectural perspectives.

[162] AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

Tara Bogavelli,Roshnee Sharma,Hari Subramani

Main category: cs.AI

TL;DR: 该论文提出了一个名为AgentArch的全面基准测试,用于评估企业在多智能体系统中的不同架构设计,分析了18种不同的智能体配置,并揭示了模型特定的架构偏好。

Details Motivation: 现有的研究多关注智能体架构的孤立组件,而缺乏对复杂多智能体系统中不同设计维度交互的实证理解。

Contribution: 提供了首个企业专用的全面基准测试,评估了18种智能体配置,并揭示了模型特定的架构偏好和性能瓶颈。

Method: 通过分析四种关键维度(编排策略、智能体提示实现、记忆架构和工具集成),评估了不同配置在复杂和简单任务上的表现。

Result: 最高得分模型在复杂任务上仅达到35.3%的成功率,在简单任务上为70.8%,表明智能体在企业任务中的整体表现仍有显著不足。

Insight: 研究挑战了现有的‘一刀切’智能体AI系统范式,为未来智能体系统的设计提供了数据支持的决策依据。

Abstract: While individual components of agentic architectures have been studied in isolation, there remains limited empirical understanding of how different design dimensions interact within complex multi-agent systems. This study aims to address these gaps by providing a comprehensive enterprise-specific benchmark evaluating 18 distinct agentic configurations across state-of-the-art large language models. We examine four critical agentic system dimensions: orchestration strategy, agent prompt implementation (ReAct versus function calling), memory architecture, and thinking tool integration. Our benchmark reveals significant model-specific architectural preferences that challenge the prevalent one-size-fits-all paradigm in agentic AI systems. It also reveals significant weaknesses in overall agentic performance on enterprise tasks with the highest scoring models achieving a maximum of only 35.3% success on the more complex task and 70.8% on the simpler task. We hope these findings inform the design of future agentic systems by enabling more empirically backed decisions regarding architectural components and model selection.

[163] Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding

Seongho Joo,Hyukhun Koh,Kyomin Jung

Main category: cs.AI

TL;DR: 论文提出了一种名为HaPLa的新型通用越狱攻击技术,通过归纳推理框架和符号编码绕过LLM的安全机制,实验显示其在GPT系列模型上成功率超95%。

Details Motivation: 大型语言模型(LLMs)在许多任务上表现出色,但其被滥用的风险不容忽视。研究通用越狱攻击有助于揭示和防御模型的固有弱点。

Contribution: 1. 提出HaPLa技术,结合归纳推理框架和符号编码,实现高效越狱;2. 实验验证其在GPT系列等模型上的高成功率;3. 揭示LLM安全性与实用性之间的权衡问题。

Method: HaPLa采用两种策略:1. 归纳推理框架(abductive framing),引导LLM推断有害活动的中间步骤;2. 符号编码(symbolic encoding),模糊有害关键词以绕过检测。

Result: HaPLa在GPT系列模型上攻击成功率达95%以上,在所有目标模型上平均为70%。

Insight: 当前LLM的安全调优可能以牺牲实用性为代价,揭示了安全性与模型性能之间的固有矛盾。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their potential misuse for harmful purposes remains a significant concern. To strengthen defenses against such vulnerabilities, it is essential to investigate universal jailbreak attacks that exploit intrinsic weaknesses in the architecture and learning paradigms of LLMs. In response, we propose \textbf{H}armful \textbf{P}rompt \textbf{La}undering (HaPLa), a novel and broadly applicable jailbreaking technique that requires only black-box access to target models. HaPLa incorporates two primary strategies: 1) \textit{abductive framing}, which instructs LLMs to infer plausible intermediate steps toward harmful activities, rather than directly responding to explicit harmful queries; and 2) \textit{symbolic encoding}, a lightweight and flexible approach designed to obfuscate harmful content, given that current LLMs remain sensitive primarily to explicit harmful keywords. Experimental results show that HaPLa achieves over 95% attack success rate on GPT-series models and 70% across all targets. Further analysis with diverse symbolic encoding rules also reveals a fundamental challenge: it remains difficult to safely tune LLMs without significantly diminishing their helpfulness in responding to benign queries.

[164] Rethinking Human Preference Evaluation of LLM Rationales

Ziang Li,Manasi Ganti,Zixian Ma,Helena Vasconcelos,Qijia He,Ranjay Krishna

Main category: cs.AI

TL;DR: 论文探讨了LLM生成的自然语言解释(rationales)的评估问题,提出通过基于属性的细粒度评估方法来替代传统的二元偏好判断,从而更全面地理解rationales的质量。

Details Motivation: 现有的rationales评估方法依赖人类或LLM的二元偏好判断,缺乏透明性和细粒度,无法深入解释rationales的优劣。

Contribution: 提出了基于属性的评估框架,识别关键rationales属性,并通过SHAP分析和属性特定的ELO分数,揭示了更细致的模型比较和解释。

Method: 结合自动指标、LLM判断和人工标注,分析rationales属性,利用SHAP解释人类偏好数据,并通过属性特定的ELO分数重新评估模型生成的rationales。

Result: 发现细粒度属性评估能更全面地描述rationales质量,为未来研究提供更可解释和可靠的评估实践。

Insight: 通过属性驱动的评估方法可以克服二元比较的局限性,为rationales质量提供更深入的洞察和模型优化方向。

Abstract: Large language models (LLMs) often generate natural language rationales – free-form explanations that help improve performance on complex reasoning tasks and enhance interpretability for human users. However, evaluating these rationales remains challenging. While recent work has relied on binary preference judgments from humans or LLM judges, such evaluations are often opaque and coarse-grained, offering limited insight into what makes one rationale better than another. In this work, we rethink preference evaluation for LLM-generated rationales by asking: (1) What attributes define good rationales? (2) Can human preferences be explained by these attributes? (3) Can attribute-based evaluation overcome the limitations of binary comparisons? We identify a set of key rationale attributes from prior literature and assess them using automatic metrics, LLM judgments, and human annotations. We then analyze two standard human preference datasets MT Bench and Chatbot Arena using SHAP to identify which attributes best explain human preference outcomes. Finally, we re-evaluate model-generated rationales using attribute-specific ELO scores, revealing more nuanced model comparisons and insights. Our findings suggest that fine-grained attribute evaluations can better characterize rationale quality and guide future research toward more interpretable and reliable evaluation practices.

[165] Formal Reasoning for Intelligent QA Systems: A Case Study in the Educational Domain

Tuan Bui,An Nguyen,Phat Thai,Minh Hua,Ngan Pham L. N.,Ngan Pham T. B.,Dung Le,Long Nguyen,Thanh-Tung Tran,Thang Bui,Tho Quan

Main category: cs.AI

TL;DR: 本文提出了MCFR(基于模型检查的形式化推理框架),结合大型语言模型(LLMs)与模型检查技术,用于支持属性验证,并提升闭环领域QA系统的推理可信度和可解释性。

Details Motivation: 现有的大型语言模型在推理任务中表现出色,但其推理轨迹通常缺乏可信性,仅能作为合理的解释而非因果推导。此外,符号引擎与LLMs的结合虽提升了可靠性,但仍局限于静态逻辑形式,难以处理动态的、基于状态的推理。

Contribution: 1. 提出MCFR框架,整合LLMs与模型检查技术,支持自然语言到形式化规范的转换与验证。 2. 引入EduMC-QA基准数据集,用于评估动态推理任务。 3. 验证了MCFR在提升推理可信度和可解释性方面的有效性。

Method: MCFR框架通过将自然语言转换为形式化规范,并利用模型检查技术验证这些规范是否满足过渡模型的要求。框架结合了LLMs的生成能力和符号推理的严谨性,支持动态的、多步的推理任务。

Result: 实验结果表明,MCFR在推理的忠实性和可解释性上优于现有的LLMs(如ChatGPT、DeepSeek和Claude),为高风险闭环领域应用提供了可靠的QA解决方案。

Insight: MCFR展示了神经符号方法在动态推理任务中的潜力,尤其是结合模型检查技术可以有效弥补LLMs在逻辑严谨性上的不足。此外,EduMC-QA数据集的引入为未来相关研究提供了评估基准。

Abstract: Reasoning is essential for closed-domain QA systems in which procedural correctness and policy compliance are critical. While large language models (LLMs) have shown strong performance on many reasoning tasks, recent work reveals that their reasoning traces are often unfaithful - serving more as plausible justifications than as causally grounded derivations. Efforts to combine LLMs with symbolic engines (e.g., Prover9, Z3) have improved reliability but remain limited to static forms of logic, struggling with dynamic, state-based reasoning such as multi-step progressions and conditional transitions. In this paper, we propose MCFR (Model Checking for Formal Reasoning), a neuro-symbolic framework that integrates LLMs with model checking to support property verification. MCFR translates natural language into formal specifications and verifies them over transition models. To support evaluation, we introduce EduMC-QA, a benchmark dataset grounded in real academic procedures. Our results show that MCFR improves reasoning faithfulness and interpretability, offering a viable path toward verifiable QA in high-stakes closed-domain applications. In addition to evaluating MCFR, we compare its performance with state-of-the-art LLMs such as ChatGPT, DeepSeek, and Claude to contextualize its effectiveness.

[166] Maestro: Self-Improving Text-to-Image Generation via Agent Orchestration

Xingchen Wan,Han Zhou,Ruoxi Sun,Hootan Nakhost,Ke Jiang,Rajarishi Sinha,Sercan Ö. Arık

Main category: cs.AI

TL;DR: Maestro 是一个自演进文本到图像(T2I)生成系统,通过多模态大语言模型(MLLM)代理的自我批评和自我演进,自动优化图像生成质量。

Details Motivation: 当前 T2I 模型依赖人工干预和迭代提示工程,导致可用性挑战。Maestro 旨在通过自主优化减少这种依赖,提高生成质量。

Contribution: 1) 提出自我批评机制,通过 MLLM 代理识别图像缺陷并提供可解释的编辑信号;2) 引入自我演进机制,利用 MLLM 进行图像迭代比较和提示优化。

Method: 结合自我批评(MLLM 代理作为批评者和验证者)和自我演进(MLLM 作为裁判)机制,迭代优化提示和图像。

Result: 实验表明,Maestro 在复杂 T2I 任务中显著优于初始提示和最先进的自动化方法,且随着 MLLM 组件的提升效果更优。

Insight: Maestro 提供了一种鲁棒、可解释且有效的自演进 T2I 生成框架,减少了人工干预的需求。

Abstract: Text-to-image (T2I) models, while offering immense creative potential, are highly reliant on human intervention, posing significant usability challenges that often necessitate manual, iterative prompt engineering over often underspecified prompts. This paper introduces Maestro, a novel self-evolving image generation system that enables T2I models to autonomously self-improve generated images through iterative evolution of prompts, using only an initial prompt. Maestro incorporates two key innovations: 1) self-critique, where specialized multimodal LLM (MLLM) agents act as ‘critics’ to identify weaknesses in generated images, correct for under-specification, and provide interpretable edit signals, which are then integrated by a ‘verifier’ agent while preserving user intent; and 2) self-evolution, utilizing MLLM-as-a-judge for head-to-head comparisons between iteratively generated images, eschewing problematic images, and evolving creative prompt candidates that align with user intents. Extensive experiments on complex T2I tasks using black-box models demonstrate that Maestro significantly improves image quality over initial prompts and state-of-the-art automated methods, with effectiveness scaling with more advanced MLLM components. This work presents a robust, interpretable, and effective pathway towards self-improving T2I generation.

[167] Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs

Amir Taherin,Juyi Lin,Arash Akbari,Arman Akbari,Pu Zhao,Weiwei Chen,David Kaeli,Yanzhi Wang

Main category: cs.AI

TL;DR: 该论文研究了五种代表性的视觉-语言-动作(VLA)模型在不同硬件平台(边缘设备和数据中心GPU)上的性能扩展性,揭示了架构选择和功率约束对性能的非线性影响,并提供了优化部署的实用见解。

Details Motivation: 尽管VLA模型在机器人控制中表现出色,但其在不同架构和硬件平台上的性能扩展性及相关功率预算尚未得到充分研究。本文旨在填补这一空白。

Contribution: 论文的主要贡献包括:1) 对五种VLA模型在边缘和数据中心GPU上的性能进行全面评估;2) 揭示了架构选择和功率约束对性能的非线性影响;3) 提出高吞吐量变体在不显著牺牲精度下的可行性。

Method: 使用LIBERO基准测试,测量了不同VLA模型的准确性及系统级指标(如延迟、吞吐量和内存峰值使用量),并结合边缘设备的功率限制和数据中心GPU的高性能配置进行分析。

Result: 结果显示:1) 架构选择(如动作标记化和主干模型大小)显著影响吞吐量和内存占用;2) 边缘设备在功率限制下表现出非线性性能下降,但部分配置可媲美老旧数据中心GPU;3) 存在高吞吐量且精度损失小的优化方案。

Insight: 研究发现挑战了数据中心硬件在机器人推理中的优越性假设,为在不同约束条件下选择和优化VLA模型提供了实用指导。

Abstract: Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic control, yet their performance scaling across model architectures and hardware platforms, as well as their associated power budgets, remain poorly understood. This work presents an evaluation of five representative VLA models – spanning state-of-the-art baselines and two newly proposed architectures – targeting edge and datacenter GPU platforms. Using the LIBERO benchmark, we measure accuracy alongside system-level metrics, including latency, throughput, and peak memory usage, under varying edge power constraints and high-performance datacenter GPU configurations. Our results identify distinct scaling trends: (1) architectural choices, such as action tokenization and model backbone size, strongly influence throughput and memory footprint; (2) power-constrained edge devices exhibit non-linear performance degradation, with some configurations matching or exceeding older datacenter GPUs; and (3) high-throughput variants can be achieved without significant accuracy loss. These findings provide actionable insights when selecting and optimizing VLAs across a range of deployment constraints. Our work challenges current assumptions about the superiority of datacenter hardware for robotic inference.

[168] Advancing Medical Artificial Intelligence Using a Century of Cases

Thomas A. Buckley,Riccardo Conci,Peter G. Brodeur,Jason Gusdorf,Sourik Beltrán,Bita Behrouzi,Byron Crowe,Jacob Dockterman,Muzzammil Muhammad,Sarah Ohnigian,Andrew Sanchez,James A. Diao,Aashna P. Shah,Daniel Restrepo,Eric S. Rosenberg,Andrew S. Lea,Marinka Zitnik,Scott H. Podolsky,Zahir Kanjee,Raja-Elie E. Abdulnour,Jacob M. Koshy,Adam Rodman,Arjun K. Manrai

Main category: cs.AI

TL;DR: 论文提出了CPC-Bench基准和Dr. CaBot AI讨论者,用于评估大型语言模型(LLMs)在复杂医学诊断和专家医学表现中的能力,发现其在文本诊断上优于医生,但在图像任务和文献检索上较弱。

Details Motivation: 医学AI通常只关注最终诊断,而忽略了专家医生所需的多方面推理和表现能力。本文旨在填补这一空白。

Contribution: 1. 创建了CPC-Bench基准,涵盖10项文本和多模态任务。2. 开发了Dr. CaBot AI讨论者,模拟人类专家的诊断和表现能力。

Method: 使用7102个CPC案例和1021个图像挑战,进行医生标注和自动化处理,评估LLMs的性能,并开发Dr. CaBot。

Result: o3 LLM在60%的案例中排名第一诊断,优于20名医生;但在图像任务和文献检索上表现较弱。Dr. CaBot在74%的试验中被误认为是人类生成的文本。

Insight: LLM在复杂医学文本任务上潜力巨大,但在图像和文献任务上仍需改进;CPC-Bench为医学AI提供了透明的进步追踪工具。

Abstract: BACKGROUND: For over a century, the New England Journal of Medicine Clinicopathological Conferences (CPCs) have tested the reasoning of expert physicians and, recently, artificial intelligence (AI). However, prior AI evaluations have focused on final diagnoses without addressing the multifaceted reasoning and presentation skills required of expert discussants. METHODS: Using 7102 CPCs (1923-2025) and 1021 Image Challenges (2006-2025), we conducted extensive physician annotation and automated processing to create CPC-Bench, a physician-validated benchmark spanning 10 text-based and multimodal tasks, against which we evaluated leading large language models (LLMs). Then, we developed “Dr. CaBot,” an AI discussant designed to produce written and slide-based video presentations using only the case presentation, modeling the role of the human expert in these cases. RESULTS: When challenged with 377 contemporary CPCs, o3 (OpenAI) ranked the final diagnosis first in 60% of cases and within the top ten in 84% of cases, outperforming a 20-physician baseline; next-test selection accuracy reached 98%. Event-level physician annotations quantified AI diagnostic accuracy per unit of information. Performance was lower on literature search and image tasks; o3 and Gemini 2.5 Pro (Google) achieved 67% accuracy on image challenges. In blinded comparisons of CaBot vs. human expert-generated text, physicians misclassified the source of the differential in 46 of 62 (74%) of trials, and scored CaBot more favorably across quality dimensions. To promote research, we are releasing CaBot and CPC-Bench. CONCLUSIONS: LLMs exceed physician performance on complex text-based differential diagnosis and convincingly emulate expert medical presentations, but image interpretation and literature retrieval remain weaker. CPC-Bench and CaBot may enable transparent and continued tracking of progress in medical AI.

q-fin.TR [Back]

[169] Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning

Yijia Xiao,Edward Sun,Tong Chen,Fang Wu,Di Luo,Wei Wang

Main category: q-fin.TR

TL;DR: Trading-R1是一个结合LLM推理与强化学习的金融交易模型,通过三阶段课程学习和多源金融数据训练,在风险调整回报和解释性上优于现有模型。

Details Motivation: 金融领域需要透明且可信任的AI分析工具,传统时间序列模型缺乏解释性,而LLMs难以将自然语言分析转化为可执行的交易决策。

Contribution: 提出了Trading-R1模型,结合战略思维和量化分析,生成结构化、基于证据的投资论点,并通过强化学习优化交易决策。

Method: 采用监督微调和三阶段由易到难的强化学习课程,利用多源金融数据集(Tauric-TR1-DB)进行训练。

Result: 在六只主要股票和ETF上的测试表明,Trading-R1在风险调整回报和回撤控制上优于开源和专有模型。

Insight: 验证了强化学习与LLM推理结合在金融决策中的潜力,同时强调了结构化分析和可解释性的重要性。

Abstract: Developing professional, structured reasoning on par with human financial analysts and traders remains a central challenge in AI for finance, where markets demand interpretability and trust. Traditional time-series models lack explainability, while LLMs face challenges in turning natural-language analysis into disciplined, executable trades. Although reasoning LLMs have advanced in step-by-step planning and verification, their application to risk-sensitive financial decisions is underexplored. We present Trading-R1, a financially-aware model that incorporates strategic thinking and planning for comprehensive thesis composition, facts-grounded analysis, and volatility-adjusted decision making. Trading-R1 aligns reasoning with trading principles through supervised fine-tuning and reinforcement learning with a three-stage easy-to-hard curriculum. Training uses Tauric-TR1-DB, a 100k-sample corpus spanning 18 months, 14 equities, and five heterogeneous financial data sources. Evaluated on six major equities and ETFs, Trading-R1 demonstrates improved risk-adjusted returns and lower drawdowns compared to both open-source and proprietary instruction-following models as well as reasoning models. The system generates structured, evidence-based investment theses that support disciplined and interpretable trading decisions. Trading-R1 Terminal will be released at https://github.com/TauricResearch/Trading-R1.

q-fin.RM [Back]

[170] Why Bonds Fail Differently? Explainable Multimodal Learning for Multi-Class Default Prediction

Yi Lu,Aifan Ling,Chaoqun Wang,Yaxin Xu

Main category: q-fin.RM

TL;DR: 论文提出了一个名为EMDLOT的可解释多模态深度学习框架,用于多类别债券违约预测,结合了时间序列和文本数据,显著提升了预测性能并增强了模型的可解释性。

Details Motivation: 中国债券市场近年来违约事件激增,但传统机器学习模型难以捕捉金融数据的非规则性和时间依赖性,深度学习模型则缺乏可解释性。因此,需要一种既能提升预测性能又能提供透明解释的解决方案。

Contribution: 1. 提出EMDLOT框架,整合数值时间序列和文本数据;2. 利用Time-Aware LSTM处理不规则序列;3. 引入软聚类和多级注意力机制提升可解释性;4. 在真实数据集上验证了模型的有效性。

Method: 1. 使用Time-Aware LSTM处理时间序列数据;2. 结合金融和宏观经济指标与债券说明书等文本数据;3. 采用软聚类和多级注意力机制;4. 通过实验和消融研究验证模型组件的作用。

Result: 在1994家中国企业的数据上,EMDLOT在召回率、F1分数和mAP上优于XGBoost和LSTM等基准模型,尤其是在违约和展期企业的识别上表现突出。

Insight: 1. 多模态数据融合(数值+文本)能显著提升违约预测性能;2. 时间感知的LSTM和注意力机制能够揭示经济直觉的违约驱动因素;3. 模型的可解释性对金融风险管理至关重要。

Abstract: In recent years, China’s bond market has seen a surge in defaults amid regulatory reforms and macroeconomic volatility. Traditional machine learning models struggle to capture financial data’s irregularity and temporal dependencies, while most deep learning models lack interpretability-critical for financial decision-making. To tackle these issues, we propose EMDLOT (Explainable Multimodal Deep Learning for Time-series), a novel framework for multi-class bond default prediction. EMDLOT integrates numerical time-series (financial/macroeconomic indicators) and unstructured textual data (bond prospectuses), uses Time-Aware LSTM to handle irregular sequences, and adopts soft clustering and multi-level attention to boost interpretability. Experiments on 1994 Chinese firms (2015-2024) show EMDLOT outperforms traditional (e.g., XGBoost) and deep learning (e.g., LSTM) benchmarks in recall, F1-score, and mAP, especially in identifying default/extended firms. Ablation studies validate each component’s value, and attention analyses reveal economically intuitive default drivers. This work provides a practical tool and a trustworthy framework for transparent financial risk modeling.

cs.RO [Back]

[171] DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-and-Language Navigation

Yunheng Wang,Yuetong Fang,Taowen Wang,Yixiao Feng,Yawen Tan,Shuning Zhang,Peiran Liu,Yiding Ji,Renjing Xu

Main category: cs.RO

TL;DR: DreamNav提出了一个基于轨迹的零样本视觉与语言导航框架,通过全局轨迹规划和主动想象能力提升导航性能。

Details Motivation: 现有零样本VLN方法依赖昂贵感知和被动场景理解,仅支持点级动作选择,导致部署成本高、动作语义不一致且规划短视。DreamNav旨在解决这些问题。

Contribution: 1)EgoView Corrector减少感知成本;2)Trajectory Predictor实现全局轨迹级规划;3)Imagination Predictor赋予主动想象能力。三者结合,首次统一轨迹规划和主动想象。

Method: 1)EgoView Corrector对齐视角;2)Trajectory Predictor生成轨迹级动作;3)Imagination Predictor实现长视野规划。仅依赖自我中心输入。

Result: 在VLN-CE和真实世界测试中,DreamNav零样本性能领先,SR和SPL指标分别提升7.49%和18.15%。

Insight: 全局轨迹规划和主动想象能力对零样本导航至关重要,避免点级动作的局限性。

Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE), which links language instructions to perception and control in the real world, is a core capability of embodied robots. Recently, large-scale pretrained foundation models have been leveraged as shared priors for perception, reasoning, and action, enabling zero-shot VLN without task-specific training. However, existing zero-shot VLN methods depend on costly perception and passive scene understanding, collapsing control to point-level choices. As a result, they are expensive to deploy, misaligned in action semantics, and short-sighted in planning. To address these issues, we present DreamNav that focuses on the following three aspects: (1) for reducing sensory cost, our EgoView Corrector aligns viewpoints and stabilizes egocentric perception; (2) instead of point-level actions, our Trajectory Predictor favors global trajectory-level planning to better align with instruction semantics; and (3) to enable anticipatory and long-horizon planning, we propose an Imagination Predictor to endow the agent with proactive thinking capability. On VLN-CE and real-world tests, DreamNav sets a new zero-shot state-of-the-art (SOTA), outperforming the strongest egocentric baseline with extra information by up to 7.49% and 18.15% in terms of SR and SPL metrics. To our knowledge, this is the first zero-shot VLN method to unify trajectory-level planning and active imagination while using only egocentric inputs.

[172] Nav-R1: Reasoning and Navigation in Embodied Scenes

Qingxiang Liu,Ting Huang,Zeyu Zhang,Hao Tang

Main category: cs.RO

TL;DR: Nav-R1是一个面向具身环境的统一推理和导航基础模型,通过构建大规模数据集Nav-CoT-110K和改进的GRPO强化学习框架,解决了现有方法在推理连贯性和实时导航平衡方面的挑战。

Details Motivation: 现有的具身导航方法在复杂3D环境中存在推理不连贯和不稳定的问题,难以平衡语义推理与实时控制。

Contribution: 1) 构建了Nav-CoT-110K数据集支持结构化推理;2) 设计了GRPO强化学习框架;3) 提出了Fast-in-Slow推理范式,分离语义推理与实时控制。

Method: 采用GRPO框架,结合格式、理解和导航三种奖励;通过Fast-in-Slow范式实现高效连贯的导航。

Result: 在具身AI基准测试中,Nav-R1平均性能提升了8%以上,并在移动机器人上验证了其鲁棒性。

Insight: 结构化推理数据集和改进的强化学习框架能够显著提升具身导航的性能和鲁棒性。

Abstract: Embodied navigation requires agents to integrate perception, reasoning, and action for robust interaction in complex 3D environments. Existing approaches often suffer from incoherent and unstable reasoning traces that hinder generalization across diverse environments, and difficulty balancing long-horizon semantic reasoning with low-latency control for real-time navigation. To address these challenges, we propose Nav-R1, an embodied foundation model that unifies reasoning in embodied environments. We first construct Nav-CoT-110K, a large-scale dataset of step-by-step Chains-of-Thought (CoT) for embodied tasks, which enables cold-start initialization with structured reasoning. Building on this foundation, we design a GRPO-based reinforcement learning framework with three complementary rewards: format, understanding, and navigation, to improve structural adherence, semantic grounding, and path fidelity. Furthermore, we introduce a Fast-in-Slow reasoning paradigm, decoupling deliberate semantic reasoning from low-latency reactive control for efficient yet coherent navigation. Extensive evaluations on embodied AI benchmarks demonstrate that Nav-R1 consistently outperforms strong baselines, with over 8% average improvement in reasoning and navigation performance. Real-world deployment on a mobile robot further validates its robustness under limited onboard resources. Code: https://github.com/AIGeeksGroup/Nav-R1. Website: https://aigeeksgroup.github.io/Nav-R1.

[173] ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations

Zheng Li,Pei Qu,Yufei Jia,Shihui Zhou,Haizhou Ge,Jiahang Cao,Jinni Zhou,Guyue Zhou,Jun Ma

Main category: cs.RO

TL;DR: ManiVID-3D proposed 3D强化学习架构,通过解耦特征学习实现视角不变表示,提升机器人操作鲁棒性。

Details Motivation: 真实世界中相机视角变化导致视觉强化学习策略失效问题,传统方法依赖精确相机标定或难以应对大视角变化。

Contribution: 提出ManiVID-3D,包含ViewNet自动对齐点云到统一坐标系,无需外参标定;开发高效GPU加速批渲染模块,实现大规模训练。

Method: 通过自监督解耦特征学习视角不变表示,ViewNet模块对齐点云,GPU加速渲染支持高效训练。

Result: 在10个仿真和5个真实任务中,比SOTA方法成功率高44.7%,且参数减少80%。

Insight: 几何一致表示是提升机器人操作鲁棒性和可扩展性的关键,解耦学习和高效渲染是核心创新。

Abstract: Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted–an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 44.7% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system’s robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments. Our project website can be found in https://zheng-joe-lee.github.io/manivid3d/.

[174] Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations

Shresth Grover,Akshay Gopalkrishnan,Bo Ai,Henrik I. Christensen,Hao Su,Xuanlin Li

Main category: cs.RO

TL;DR: 该论文提出了一种框架,通过在微调过程中保留预训练特征,提升视觉-语言-动作(VLA)模型的泛化能力。

Details Motivation: 现有的VLA模型在微调时通常会破坏预训练特征,从而限制了其泛化能力。

Contribution: 提出了一个双编码器设计、基于字符串的动作分词器和联合训练策略,以更好地保留预训练特征。

Method: 使用冻结的视觉编码器保留预训练特征,同时训练另一编码器;将连续动作转换为字符序列;结合机器人演示和视觉-语言数据进行联合训练。

Result: 实验表明,该方法在视觉扰动、新指令和新环境下的泛化能力以及任务成功率上均优于基线模型。

Insight: 在微调过程中保留预训练特征是提升VLA模型泛化能力的关键。

Abstract: Vision-language-action (VLA) models finetuned from vision-language models (VLMs) hold the promise of leveraging rich pretrained representations to build generalist robots across diverse tasks and environments. However, direct fine-tuning on robot data often disrupts these representations and limits generalization. We present a framework that better preserves pretrained features while adapting them for robot manipulation. Our approach introduces three components: (i) a dual-encoder design with one frozen vision encoder to retain pretrained features and another trainable for task adaptation, (ii) a string-based action tokenizer that casts continuous actions into character sequences aligned with the model’s pretraining domain, and (iii) a co-training strategy that combines robot demonstrations with vision-language datasets emphasizing spatial reasoning and affordances. Evaluations in simulation and on real robots show that our method improves robustness to visual perturbations, generalization to novel instructions and environments, and overall task success compared to baselines.

[175] ParaEQsA: Parallel and Asynchronous Embodied Questions Scheduling and Answering

Haisheng Wang,Weiming Zhi

Main category: cs.RO

TL;DR: 论文提出了并行异步的Embodied Questions Answering (EQsA)问题及其解决方案ParaEQsA框架,通过共享记忆模块和优先级规划提升多问题处理的效率。

Details Motivation: 传统的Embodied Question Answering (EQA)仅处理单一问题,而实际应用中需要异步处理多个不同紧急程度的问题。

Contribution: 1. 定义了EQsA问题并提出了PAEQs基准;2. 提出了ParaEQsA框架,支持并行和异步的问题调度与回答;3. 设计了衡量性能的新指标DAR和NUWL。

Method: ParaEQsA采用共享记忆模块减少冗余探索,并通过优先级规划模块动态调度问题。

Result: ParaEQsA在性能上优于传统顺序处理方法,减少了探索延迟,并通过实验验证了各模块的贡献。

Insight: 优先级和紧急程度建模是提升多问题处理效率的关键,同时共享记忆机制能有效减少冗余操作。

Abstract: This paper formulates the Embodied Questions Answering (EQsA) problem, introduces a corresponding benchmark, and proposes a system to tackle the problem. Classical Embodied Question Answering (EQA) is typically formulated as answering one single question by actively exploring a 3D environment. Real deployments, however, often demand handling multiple questions that may arrive asynchronously and carry different urgencies. We formalize this setting as Embodied Questions Answering (EQsA) and present ParaEQsA, a framework for parallel, urgency-aware scheduling and answering. ParaEQsA leverages a group memory module shared among questions to reduce redundant exploration, and a priority-planning module to dynamically schedule questions. To evaluate this setting, we contribute the Parallel Asynchronous Embodied Questions (PAEQs) benchmark containing 40 indoor scenes and five questions per scene (200 in total), featuring asynchronous follow-up questions and urgency labels. We further propose metrics for EQsA performance: Direct Answer Rate (DAR), and Normalized Urgency-Weighted Latency (NUWL), which jointly measure efficiency and responsiveness of this system. ParaEQsA consistently outperforms strong sequential baselines adapted from recent EQA systems, while reducing exploration and delay. Empirical evaluations investigate the relative contributions of priority, urgency modeling, spatial scope, reward estimation, and dependency reasoning within our framework. Together, these results demonstrate that urgency-aware, parallel scheduling is key to making embodied agents responsive and efficient under realistic, multi-question workloads.

[176] TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning

Jiacheng Liu,Pengxiang Ding,Qihang Zhou,Yuxuan Wu,Da Huang,Zimian Peng,Wei Xiao,Weinan Zhang,Lixin Yang,Cewu Lu,Donglin Wang

Main category: cs.RO

TL;DR: 论文提出了一种名为KORR的框架,通过结合Koopman算子和残差策略学习,提升人形机器人在全局状态理解和长期任务中的表现。

Details Motivation: 模仿学习在长期任务和高精度控制中存在误差累积问题,现有残差策略方法缺乏全局状态理解,限制了鲁棒性和泛化能力。

Contribution: 提出了KORR框架,利用Koopman算子理论在潜在空间中引入线性时不变结构,结合残差学习实现全局指导的策略改进。

Method: 通过Koopman算子建模全局动态,在潜在空间中预测状态转移,并基于此指导残差策略的更新。

Result: 在长期、精细的机器人装配任务中,KORR表现出优于基线的性能、鲁棒性和泛化能力。

Insight: Koopman理论为现代学习方法与经典控制理论提供了桥梁,展示了全局动态建模在强化学习中的潜力。

Abstract: Imitation learning (IL) enables efficient skill acquisition from demonstrations but often struggles with long-horizon tasks and high-precision control due to compounding errors. Residual policy learning offers a promising, model-agnostic solution by refining a base policy through closed-loop corrections. However, existing approaches primarily focus on local corrections to the base policy, lacking a global understanding of state evolution, which limits robustness and generalization to unseen scenarios. To address this, we propose incorporating global dynamics modeling to guide residual policy updates. Specifically, we leverage Koopman operator theory to impose linear time-invariant structure in a learned latent space, enabling reliable state transitions and improved extrapolation for long-horizon prediction and unseen environments. We introduce KORR (Koopman-guided Online Residual Refinement), a simple yet effective framework that conditions residual corrections on Koopman-predicted latent states, enabling globally informed and stable action refinement. We evaluate KORR on long-horizon, fine-grained robotic furniture assembly tasks under various perturbations. Results demonstrate consistent gains in performance, robustness, and generalization over strong baselines. Our findings further highlight the potential of Koopman-based modeling to bridge modern learning methods with classical control theory. For more details, please refer to https://jiachengliu3.github.io/TrajBooster.

cs.CE [Back]

[177] FinGEAR: Financial Mapping-Guided Enhanced Answer Retrieval

Ying Li,Mengyu Wang,Miguel de Carvalho,Sotirios Sabanis,Tiejun Ma

Main category: cs.CE

TL;DR: FinGEAR是一种针对金融文档的检索框架,通过结合金融词汇表、双层次索引和两阶段交叉编码器重排器,显著提高了金融披露文件的检索性能。

Details Motivation: 金融披露文件(如10-K报表)由于长度、监管部分层次结构和领域特定语言,传统检索增强生成(RAG)模型效果不佳,需要一种更精细化的检索方法。

Contribution: 提出了FinGEAR框架,结合金融词汇表(FLAM)、双层次索引(概要树和问题树)和两阶段重排器,显著提升检索精度和下游任务表现。

Method: 1. 使用金融词汇表(FLAM)进行项目级检索指导;2. 构建双层次索引(概要树和问题树)支持细粒度检索;3. 采用两阶段交叉编码器重排器优化结果。

Result: 在FinQA数据集上的实验显示,FinGEAR在F1分数上比传统RAG提升56.7%,比图基RAG提升12.5%,比树基系统提升217.6%,同时提高了答案生成准确性。

Insight: 通过联合建模部分层次结构和领域词汇信号,FinGEAR显著提升了金融文档检索的精度和实用性,为高风险金融分析提供了可靠基础。

Abstract: Financial disclosures such as 10-K filings present challenging retrieval problems due to their length, regulatory section hierarchy, and domain-specific language, which standard retrieval-augmented generation (RAG) models underuse. We introduce FinGEAR (Financial Mapping-Guided Enhanced Answer Retrieval), a retrieval framework tailored to financial documents. FinGEAR combines a finance lexicon for Item-level guidance (FLAM), dual hierarchical indices for within-Item search (Summary Tree and Question Tree), and a two-stage cross-encoder reranker. This design aligns retrieval with disclosure structure and terminology, enabling fine-grained, query-aware context selection. Evaluated on full 10-Ks with queries aligned to the FinQA dataset, FinGEAR delivers consistent gains in precision, recall, F1, and relevancy, improving F1 by up to 56.7% over flat RAG, 12.5% over graph-based RAGs, and 217.6% over prior tree-based systems, while also increasing downstream answer accuracy with a fixed reader. By jointly modeling section hierarchy and domain lexicon signals, FinGEAR improves retrieval fidelity and provides a practical foundation for high-stakes financial analysis.

cs.LG [Back]

[178] Agentic Username Suggestion and Multimodal Gender Detection in Online Platforms: Introducing the PNGT-26K Dataset

Farbod Bijary,Mohsen Ebadpour,Amirhosein Tajbakhsh

Main category: cs.LG

TL;DR: 论文介绍了PNGT-26K数据集,用于解决波斯名字在性别检测和数字身份创建中的挑战,并提出了Open Gender Detection和Nominalist两种框架。

Details Motivation: 波斯名字在自然语言处理中存在独特的挑战,如转写不一致和文化特定的命名模式,现有工具表现不佳且缺乏数据集。

Contribution: 1. 提出PNGT-26K数据集;2. 开发Open Gender Detection框架,用于多模态性别检测;3. 提出Nominalist框架,用于智能推荐用户名。

Method: 1. 构建包含约26,000个波斯名字、性别和英文转写的PNGT-26K数据集;2. 开发基于用户数据(如照片和名字)的性别检测框架;3. 设计基于AI的用户名推荐系统。

Result: PNGT-26K数据集和两个框架已公开在GitHub上,可用于实际应用。

Insight: 该研究填补了波斯名字处理的数据空白,并提供了实用的工具,对多语言NLP和用户体验优化有重要意义。

Abstract: Persian names present unique challenges for natural language processing applications, particularly in gender detection and digital identity creation, due to transliteration inconsistencies and cultural-specific naming patterns. Existing tools exhibit significant performance degradation on Persian names, while the scarcity of comprehensive datasets further compounds these limitations. To address these challenges, the present research introduces PNGT-26K, a comprehensive dataset of Persian names, their commonly associated gender, and their English transliteration, consisting of approximately 26,000 tuples. As a demonstration of how this resource can be utilized, we also introduce two frameworks, namely Open Gender Detection and Nominalist. Open Gender Detection is a production-grade, ready-to-use framework for using existing data from a user, such as profile photo and name, to give a probabilistic guess about the person’s gender. Nominalist, the second framework introduced by this paper, utilizes agentic AI to help users choose a username for their social media accounts on any platform. It can be easily integrated into any website to provide a better user experience. The PNGT-26K dataset, Nominalist and Open Gender Detection frameworks are publicly available on Github.

[179] Opal: An Operator Algebra View of RLHF

Madhava Gaikwad

Main category: cs.LG

TL;DR: Opal提出了一种基于运算符代数的RLHF(人类反馈强化学习)视角,通过添加惩罚和乘性成对权重的基元构建目标。引入了通用内核偏好对象(GKPO)作为规范化表示方法。

Details Motivation: 传统RLHF方法缺乏统一的数学框架和规范化表示,Opal试图通过运算符代数提供一种标准化视角。

Contribution: 1. 提出Opal框架,将RLHF表达为具有加法惩罚和乘性权重的基元组合。2. 引入GKPO作为统一表示方法,支持规范化、序列化和跨方法转换。

Method: 1. 定义基元(加法惩罚和乘性权重)及其归约条件。2. 提出GKPO框架,实现标准化表示和验证非归约性的测试用例。

Result: 通过GKPO展示了DPO、RRHF和ORPO的实现,并验证了在非归约性条件下的行为。

Insight: 通过运算符代数视角,RLHF可以被规范化表示,非归约性条件揭示了现有方法的局限性。

Abstract: We present Opal, an operator view of reinforcement learning from human feedback (RLHF). Objectives are expressed as ladders of two primitives on a base utility: additive penalties and multiplicative pairwise weights. We describe a simple reduction law with if-and-only-if conditions: such ladders collapse to a normal form on pairwise margins when the reference is fixed, penalties are additive, and weights are independent of intermediate margins. When these assumptions do not hold (reference shift, non-additive gates, score-dependent weights), small examples demonstrate non-reducibility. Building on this view, we introduce GKPO (Generalized Kernel Preference Object), a canonical schema in which many RLHF methods can be represented and, when reducible, mapped back from. GKPO provides a standard JSON serialization, canonicalization and hashing rules, and explicit flags with finite witnesses when assumptions fail. We illustrate these ideas with GKPO examples for DPO, RRHF, and ORPO, along with cross-method conversions (where assumptions permit) and minimal stress tests (SHIFT/GATE/SCORE) that highlight non-reducibility. A lightweight Python reference library accompanies the schema, implementing canonical hashing and adapters for DPO and RRHF.

[180] Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Yining Lu,Zilong Wang,Shiyang Li,Xin Liu,Changlong Yu,Qingyu Yin,Zhan Shi,Zixuan Zhang,Meng Jiang

Main category: cs.LG

TL;DR: 该论文提出动态奖励加权方法,解决多目标强化学习中固定权重线性标量化无法捕捉非凸帕累托前沿的问题,适用于大规模语言模型的在线偏好对齐。

Details Motivation: 传统多目标强化学习使用固定权重的线性标量化方法,无法有效处理非凸帕累托前沿,导致次优结果。这在在线偏好对齐任务中尤为关键。

Contribution: 引入了动态奖励加权方法,通过自适应调整奖励权重,有效探索帕累托前沿,并提出两种具体实现:超体积引导的加权适应和基于梯度的加权优化。

Method: 1. 超体积引导的加权适应;2. 梯度法优化奖励权重。两种方法均与常见的在线强化学习算法兼容。

Result: 实验表明,动态加权方法在多种数学推理数据集和模型族中均优于固定加权基线,能以更少的训练步数获得帕累托优势解。

Insight: 动态加权方法在多目标优化任务中具有普适性,尤其是在高度非线性和非凸的目标空间中表现更优。

Abstract: Prior works in multi-objective reinforcement learning typically use linear reward scalarization with fixed weights, which provably fail to capture non-convex Pareto fronts and thus yield suboptimal results. This limitation becomes especially critical in online preference alignment for large language models. Here, stochastic trajectories generated by parameterized policies create highly non-linear and non-convex mappings from parameters to objectives that no single static weighting scheme can find optimal trade-offs. We address this limitation by introducing dynamic reward weighting, which adaptively adjusts reward weights during the online reinforcement learning process. Unlike existing approaches that rely on fixed-weight interpolation, our dynamic weighting continuously balances and prioritizes objectives in training, facilitating effective exploration of Pareto fronts in objective space. We introduce two approaches of increasing sophistication and generalizability: (1) hypervolume-guided weight adaptation and (2) gradient-based weight optimization, offering a versatile toolkit for online multi-objective alignment. Our extensive experiments demonstrate their compatibility with commonly used online reinforcement learning algorithms (including GRPO, REINFORCE, and RLOO), effectiveness across multiple mathematical reasoning datasets, and applicability to different model families, consistently achieving Pareto dominant solutions with fewer training steps than fixed-weight linear scalarization baselines.

[181] Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs

HG Ranjani,Rutuja Prabhudesai

Main category: cs.LG

TL;DR: 该论文提出了一种性能指标来衡量电信领域中图像到UML转换的准确性,展示了GPT-4V和Claude Sonnet两种VLMs在不同组件上的表现,并指出复杂结构转换的不足。

Details Motivation: 电信领域的3GPP文档包含大量序列图,现有研究在图像到机器可读PlantUML格式的转换评估中存在不足,缺少对不同组件的比较指标。

Contribution: 提出了一种性能指标,用于量化图像到UML转换中各组件的准确性(如参与者识别、消息流顺序等),并分析了VLMs在复杂结构上的表现。

Method: 通过比较GPT-4V和Claude Sonnet生成的PlantUML脚本与人工标注的真实数据,使用版本控制工具和标准性能指标来评估转换效果。

Result: 实验表明VLMs在基本组件(节点、边、消息)上表现良好,但在复杂结构(如注释、框、组)上表现不佳。

Insight: 训练数据中需要更准确表示复杂结构,以进一步提升VLMs在图像到UML转换任务中的性能。

Abstract: Telecom domain 3GPP documents are replete with images containing sequence diagrams. Advances in Vision-Language Large Models (VLMs) have eased conversion of such images to machine-readable PlantUML (puml) formats. However, there is a gap in evaluation of such conversions - existing works do not compare puml scripts for various components. In this work, we propose performance metrics to measure the effectiveness of such conversions. A dataset of sequence diagrams from 3GPP documents is chosen to be representative of domain-specific actual scenarios. We compare puml outputs from two VLMs - Claude Sonnet and GPT-4V - against manually created ground truth representations. We use version control tools to capture differences and introduce standard performance metrics to measure accuracies along various components: participant identification, message flow accuracy, sequence ordering, and grouping construct preservation. We demonstrate effectiveness of proposed metrics in quantifying conversion errors across various components of puml scripts. The results show that nodes, edges and messages are accurately captured. However, we observe that VLMs do not necessarily perform well on complex structures such as notes, box, groups. Our experiments and performance metrics indicates a need for better representation of these components in training data for fine-tuned VLMs.

[182] Event2Vec: A Geometric Approach to Learning Composable Representations of Event Sequences

Antonin Sulc

Main category: cs.LG

TL;DR: Event2Vec提出了一个几何框架,用于学习离散事件序列的可组合表示,通过欧几里得和双曲空间分别处理线性和层次化数据。

Details Motivation: 受神经表示的几何和拓扑结构重要性启发,旨在为事件序列学习可组合且可解释的嵌入表示。

Contribution: 1. 提出Event2Vec框架,证明其嵌入在欧几里得空间中符合线性加法假设;2. 引入双曲空间变体以更好地嵌入层次化数据。

Method: 1. 使用加性递归结构学习欧几里得空间嵌入;2. 扩展至双曲空间以处理层次化数据;3. 理论分析收敛性和几何适用性。

Result: 实验验证线性加法假设的有效性,双曲模型在层次化事件序列中表现更优。

Insight: 几何空间的选择对事件序列表示至关重要,双曲空间更适合层次化结构。

Abstract: The study of neural representations, both in biological and artificial systems, is increasingly revealing the importance of geometric and topological structures. Inspired by this, we introduce Event2Vec, a novel framework for learning representations of discrete event sequences. Our model leverages a simple, additive recurrent structure to learn composable, interpretable embeddings. We provide a theoretical analysis demonstrating that, under specific training objectives, our model’s learned representations in a Euclidean space converge to an ideal additive structure. This ensures that the representation of a sequence is the vector sum of its constituent events, a property we term the linear additive hypothesis. To address the limitations of Euclidean geometry for hierarchical data, we also introduce a variant of our model in hyperbolic space, which is naturally suited to embedding tree-like structures with low distortion. We present experiments to validate our hypothesis and demonstrate the benefits of each geometry, highlighting the improved performance of the hyperbolic model on hierarchical event sequences.

[183] Multimodal Deep Learning for ATCO Command Lifecycle Modeling and Workload Prediction

Kaizhen Tan

Main category: cs.LG

TL;DR: 论文提出了一种多模态深度学习框架,用于预测空中交通管制员(ATCO)命令的生命周期参数,包括时间偏移和命令持续时间,支持工作负载评估和调度。

Details Motivation: 空中交通管制员在高密度空域中发出高强度语音命令,准确的工作负载建模对安全和效率至关重要,但目前缺乏有效的预测方法。

Contribution: 1. 构建了一个高质量的数据集;2. 开发了结合CNN和Transformer的集成模型;3. 首次将轨迹与语音命令关联,支持智能命令生成。

Method: 使用滑动窗口和直方图方法检测机动点,设计了一个多模态框架,整合结构化数据、轨迹序列和图像特征,并使用CNN-Transformer模型进行预测。

Result: 模型能够准确、可泛化且可解释地预测时间偏移和命令持续时间。

Insight: 通过多模态数据融合,模型在实际工作负载评估和调度中具有实用价值,为未来智能命令生成奠定了基础。

Abstract: Air traffic controllers (ATCOs) issue high-intensity voice commands in dense airspace, where accurate workload modeling is critical for safety and efficiency. This paper proposes a multimodal deep learning framework that integrates structured data, trajectory sequences, and image features to estimate two key parameters in the ATCO command lifecycle: the time offset between a command and the resulting aircraft maneuver, and the command duration. A high-quality dataset was constructed, with maneuver points detected using sliding window and histogram-based methods. A CNN-Transformer ensemble model was developed for accurate, generalizable, and interpretable predictions. By linking trajectories to voice commands, this work offers the first model of its kind to support intelligent command generation and provides practical value for workload assessment, staffing, and scheduling.

[184] CrunchLLM: Multitask LLMs for Structured Business Reasoning and Outcome Prediction

Rabeya Tus Sadia,Qiang Cheng

Main category: cs.LG

TL;DR: 论文提出CrunchLLM,一个结合结构化与非结构化数据的LLM框架,用于预测初创公司成功,准确率超80%,并支持可解释推理。

Details Motivation: 传统机器学习在预测初创公司成功时依赖于结构化数据且准确率有限,LLMs虽具备丰富推理能力但难以直接适配业务领域数据。因此需要一种方法有效融合异构数据并提升预测性能。

Contribution: 1. 提出CrunchLLM框架,结合结构化公司属性和非结构化文本,实现领域适应的LLM微调;2. 通过提示优化和参数高效微调提升预测性能;3. 提供可解释的推理轨迹。

Method: 1. 整合Crunchbase的结构化与非结构化数据;2. 采用参数高效微调策略和提示优化对基础模型进行领域适配;3. 结合领域感知微调提升预测和推理能力。

Result: 在Crunchbase数据集上准确率超过80%,显著优于传统分类器和基线LLMs,并提供可解释的预测依据。

Insight: 领域适应的LLM框架能有效融合异构数据并提升预测性能,同时增强模型透明度,适用于金融和政策决策等场景。

Abstract: Predicting the success of start-up companies, defined as achieving an exit through acquisition or IPO, is a critical problem in entrepreneurship and innovation research. Datasets such as Crunchbase provide both structured information (e.g., funding rounds, industries, investor networks) and unstructured text (e.g., company descriptions), but effectively leveraging this heterogeneous data for prediction remains challenging. Traditional machine learning approaches often rely only on structured features and achieve moderate accuracy, while large language models (LLMs) offer rich reasoning abilities but struggle to adapt directly to domain-specific business data. We present \textbf{CrunchLLM}, a domain-adapted LLM framework for startup success prediction. CrunchLLM integrates structured company attributes with unstructured textual narratives and applies parameter-efficient fine-tuning strategies alongside prompt optimization to specialize foundation models for entrepreneurship data. Our approach achieves accuracy exceeding 80% on Crunchbase startup success prediction, significantly outperforming traditional classifiers and baseline LLMs. Beyond predictive performance, CrunchLLM provides interpretable reasoning traces that justify its predictions, enhancing transparency and trustworthiness for financial and policy decision makers. This work demonstrates how adapting LLMs with domain-aware fine-tuning and structured–unstructured data fusion can advance predictive modeling of entrepreneurial outcomes. CrunchLLM contributes a methodological framework and a practical tool for data-driven decision making in venture capital and innovation policy.

[185] SelectMix: Enhancing Label Noise Robustness through Targeted Sample Mixing

Qiuhao Liu,Ling Li,Yao Lu,Qi Xuan,Zhaowei Zhu,Jiaheng Wei

Main category: cs.LG

TL;DR: SelectMix提出了一种基于置信度引导的样本混合框架,专门用于处理噪声标签问题,通过K折交叉验证识别噪声样本,并选择性地混合不确定样本与高置信度样本,显著提升了模型在噪声标签下的鲁棒性。

Details Motivation: 深度神经网络容易记忆噪声标签,导致泛化性能下降。传统的Mixup方法缺乏对样本选择和混合策略的指导,可能传播噪声监督。SelectMix旨在通过置信度引导的混合策略解决这一问题。

Contribution: 提出了SelectMix框架,通过置信度分析和选择性混合减少噪声标签的影响;引入软标签以更准确地反映混合样本的真实分布,从而改善监督信号。

Method: 利用K折交叉验证识别潜在噪声或模糊样本;选择性地混合不确定样本与高置信度样本;使用所有参与混合类的软标签生成更准确的监督信号。

Result: 在合成(MNIST、Fashion-MNIST、CIFAR等)和真实世界(CIFAR-N、Clothing1M等)数据集上,SelectMix显著优于基线方法,证明了其鲁棒性和有效性。

Insight: 选择性混合和软标签的结合能够有效缓解噪声标签问题,为噪声标签学习提供了新的思路。置信度分析的引入使得混合过程更具针对性。

Abstract: Deep neural networks tend to memorize noisy labels, severely degrading their generalization performance. Although Mixup has demonstrated effectiveness in improving generalization and robustness, existing Mixup-based methods typically perform indiscriminate mixing without principled guidance on sample selection and mixing strategy, inadvertently propagating noisy supervision. To overcome these limitations, we propose SelectMix, a confidence-guided mixing framework explicitly tailored for noisy labels. SelectMix first identifies potentially noisy or ambiguous samples through confidence based mismatch analysis using K-fold cross-validation, then selectively blends identified uncertain samples with confidently predicted peers from their potential classes. Furthermore, SelectMix employs soft labels derived from all classes involved in the mixing process, ensuring the labels accurately represent the composition of the mixed samples, thus aligning supervision signals closely with the actual mixed inputs. Through extensive theoretical analysis and empirical evaluations on multiple synthetic (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100) and real-world benchmark datasets (CIFAR-N, MNIST and Clothing1M), we demonstrate that SelectMix consistently outperforms strong baseline methods, validating its effectiveness and robustness in learning with noisy labels.

[186] PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Loka Li,Wong Yu Kang,Minghao Fu,Guangyi Chen,Zhenhao Chen,Gongxu Luo,Yuewen Sun,Salman Khan,Peter Spirtes,Kun Zhang

Main category: cs.LG

TL;DR: PersonaX提出了一种多模态数据集,结合了行为特征、面部图像和传记信息,旨在支持跨模态的行为特征分析。

Details Motivation: 现有数据集缺乏结合行为特征和多模态信息(如面部属性和传记)的资源,阻碍了对人类行为特征的全面理解。

Contribution: 1) 发布PersonaX,包含CelebPersona和AthlePersona两个多模态数据集;2) 提出一种新颖的因果表示学习框架(CRL),支持多模态数据分析和因果推理。

Method: 1) 利用三个高性能大语言模型(LLM)推断行为特征;2) 结合统计独立性测试和高层次特征分析;3) 提出CRL框架,提供理论可识别性保证。

Result: 实验证明了CRL框架在合成和真实数据上的有效性,为多模态行为分析提供了新工具。

Insight: PersonaX为LLM推断的行为特征与视觉、传记信息的结合分析提供了基础,推动了多模态特征分析和因果推理的研究。

Abstract: Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning.

[187] SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching

Jiacheng Liu,Chang Zou,Yuanhuiyi Lyu,Fei Ren,Shaobo Wang,Kaixin Li,Linfeng Zhang

Main category: cs.LG

TL;DR: SpeCa提出了一种基于‘预测-验证’的加速框架,通过预测中间特征并动态验证其可靠性,显著提升了扩散模型的推理速度,同时保持了生成质量。

Details Motivation: 扩散模型在高保真图像和视频合成中表现优异,但其严格的时序依赖性和计算密集性限制了实时应用。如何在不牺牲质量的情况下加速推理是关键挑战。

Contribution: 1. 引入‘Forecast-then-verify’框架,首次将Speculative Sampling应用到扩散模型中;2. 提出参数无关的验证机制和样本自适应计算分配策略。

Method: 1. 预测中间特征并动态验证其可靠性;2. 根据样本复杂度动态调整计算资源,减少简单样本的计算成本。

Result: 在FLUX上实现6.34倍加速(质量下降5.5%),DiT上7.3倍加速且保真,HunyuanVideo上6.1倍加速仍保持79.84%的VBench评分。

Insight: 通过预测和验证机制,SpeCa在加速扩散模型的同时避免了质量显著下降,为实时应用提供了一种高效范式。

Abstract: Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel ‘Forecast-then-verify’ acceleration framework that effectively addresses both limitations. SpeCa’s core innovation lies in introducing Speculative Sampling to diffusion models, predicting intermediate features for subsequent timesteps based on fully computed reference timesteps. Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction while incurring negligible computational overhead. Furthermore, SpeCa introduces sample-adaptive computation allocation that dynamically modulates resources based on generation complexity, allocating reduced computation for simpler samples while preserving intensive processing for complex instances. Experiments demonstrate 6.34x acceleration on FLUX with minimal quality degradation (5.5% drop), 7.3x speedup on DiT while preserving generation fidelity, and 79.84% VBench score at 6.1x acceleration for HunyuanVideo. The verification mechanism incurs minimal overhead (1.67%-3.5% of full inference costs), establishing a new paradigm for efficient diffusion model inference while maintaining generation quality even at aggressive acceleration ratios. Our codes have been released in Github: \textbf{https://github.com/Shenyi-Z/Cache4Diffusion}

[188] DRAG: Data Reconstruction Attack using Guided Diffusion

Wa-Kin Lei,Jun-Cheng Chen,Shang-Tse Chen

Main category: cs.LG

TL;DR: 该论文提出了一种基于引导扩散的新型数据重构攻击方法(DRAG),用于在分割推理(SI)场景下从视觉基础模型的深层中间表示(IR)中重构高保真度的原始数据。

Details Motivation: 随着大型基础模型的兴起,分割推理成为一种流行的计算范式,但其隐私风险尚未充分研究。现有攻击方法主要针对小型CNN分类模型,缺乏对基础模型的探究。

Contribution: 论文的主要贡献是首次将引导扩散模型(LDM)引入数据重构攻击,利用其在大规模数据集上的先验知识,实现从深层IR中高效重构高保真图像。

Method: 通过迭代重构LDM学习的图像先验,利用预训练的潜在扩散模型,从中间表示中生成与原数据高度相似的图像。

Result: 实验表明,该方法在定性和定量上均显著优于现有方法,突显了SI场景下大型模型隐私保护的紧迫性。

Insight: 结果表明,基础模型的分割推理场景存在严重隐私风险,需开发更鲁棒的隐私保护机制。

Abstract: With the rise of large foundation models, split inference (SI) has emerged as a popular computational paradigm for deploying models across lightweight edge devices and cloud servers, addressing data privacy and computational cost concerns. However, most existing data reconstruction attacks have focused on smaller CNN classification models, leaving the privacy risks of foundation models in SI settings largely unexplored. To address this gap, we propose a novel data reconstruction attack based on guided diffusion, which leverages the rich prior knowledge embedded in a latent diffusion model (LDM) pre-trained on a large-scale dataset. Our method performs iterative reconstruction on the LDM’s learned image prior, effectively generating high-fidelity images resembling the original data from their intermediate representations (IR). Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, both qualitatively and quantitatively, in reconstructing data from deep-layer IRs of the vision foundation model. The results highlight the urgent need for more robust privacy protection mechanisms for large models in SI scenarios. Code is available at: https://github.com/ntuaislab/DRAG.

cs.GR [Back]

[189] AD-GS: Alternating Densification for Sparse-Input 3D Gaussian Splatting

Gurutva Patle,Nilay Girgaonkar,Nagabhushan Somraj,Rajiv Soundararajan

Main category: cs.GR

TL;DR: AD-GS提出了一种交替密集化框架,通过控制高密集化和低密集化阶段交替进行,解决了3D高斯泼溅在稀疏视角下的过拟合和几何一致性问题,显著提高了渲染质量。

Details Motivation: 在稀疏视角下,3D高斯泼溅(3DGS)容易产生浮动物体、几何不准确和过拟合等问题。研究发现,无控制的密集化是关键因素之一。

Contribution: 提出了AD-GS框架,通过交替的高密集化和低密集化阶段,控制模型容量增长,逐步优化场景表示。

Method: 高密集化阶段激进地增加高斯基元以捕捉细节;低密集化阶段通过伪视角一致性和边缘感知深度平滑性修剪和正则化高斯基元。

Result: 在多个数据集上的实验表明,AD-GS在渲染质量和几何一致性上显著优于现有方法。

Insight: 控制模型容量增长的交替策略能有效减少过拟合,同时逐步优化场景表示。

Abstract: 3D Gaussian Splatting (3DGS) has shown impressive results in real-time novel view synthesis. However, it often struggles under sparse-view settings, producing undesirable artifacts such as floaters, inaccurate geometry, and overfitting due to limited observations. We find that a key contributing factor is uncontrolled densification, where adding Gaussian primitives rapidly without guidance can harm geometry and cause artifacts. We propose AD-GS, a novel alternating densification framework that interleaves high and low densification phases. During high densification, the model densifies aggressively, followed by photometric loss based training to capture fine-grained scene details. Low densification then primarily involves aggressive opacity pruning of Gaussians followed by regularizing their geometry through pseudo-view consistency and edge-aware depth smoothness. This alternating approach helps reduce overfitting by carefully controlling model capacity growth while progressively refining the scene representation. Extensive experiments on challenging datasets demonstrate that AD-GS significantly improves rendering quality and geometric consistency compared to existing methods.

[190] SH-SAS: An Implicit Neural Representation for Complex Spherical-Harmonic Scattering Fields for 3D Synthetic Aperture Sonar

Omkar Shailendra Vengurlekar,Adithya Pediredla,Suren Jayasuriya

Main category: cs.GR

TL;DR: SH-SAS提出了一种基于球谐函数的隐式神经表示方法,用于3D合成孔径声呐的复杂散射场建模,优于传统和各向同性神经体积方法。

Details Motivation: 传统时间域反投影方法在3D合成孔径声呐重建中存在方向性建模不足、采样限制和混叠问题,而现有神经体积方法仅处理各向同性散射。SH-SAS旨在建模方向依赖性散射场。

Contribution: 提出SH-SAS,一种基于球谐系数的隐式神经表示,用于建模复杂散射场;通过轻量级MLP输出球谐系数,兼顾各向同性与方向性散射,直接利用原始信号训练。

Method: 使用多分辨率哈希编码器输入到一个轻量级MLP中,输出指定阶数L的复数球谐系数。零阶系数表示各向同性散射场,高阶系数捕捉方向性散射。

Result: 在合成和真实数据(包括空中和水下)测试中,SH-SAS在3D重建质量和几何指标上优于先前方法。

Insight: 将球谐函数引入隐式神经表示能够紧凑建模方向性散射,且无需中间波束成形监督,直接利用原始信号训练的方法具有潜力。

Abstract: Synthetic aperture sonar (SAS) reconstruction requires recovering both the spatial distribution of acoustic scatterers and their direction-dependent response. Time-domain backprojection is the most common 3D SAS reconstruction algorithm, but it does not model directionality and can suffer from sampling limitations, aliasing, and occlusion. Prior neural volumetric methods applied to synthetic aperture sonar treat each voxel as an isotropic scattering density, not modeling anisotropic returns. We introduce SH-SAS, an implicit neural representation that expresses the complex acoustic scattering field as a set of spherical harmonic (SH) coefficients. A multi-resolution hash encoder feeds a lightweight MLP that outputs complex SH coefficients up to a specified degree L. The zeroth-order coefficient acts as an isotropic scattering field, which also serves as the density term, while higher orders compactly capture directional scattering with minimal parameter overhead. Because the model predicts the complex amplitude for any transmit-receive baseline, training is performed directly from 1-D time-of-flight signals without the need to beamform intermediate images for supervision. Across synthetic and real SAS (both in-air and underwater) benchmarks, results show that SH-SAS performs better in terms of 3D reconstruction quality and geometric metrics than previous methods.

cs.IR [Back]

[191] DSRAG: A Domain-Specific Retrieval Framework Based on Document-derived Multimodal Knowledge Graph

Mengzheng Yang,Yanfei Ren,David Osei Opoku,Ruochang Li,Peng Ren,Chunxiao Xing

Main category: cs.IR

TL;DR: DSRAG 是一个基于多模态知识图谱的领域特定检索增强生成框架,通过整合文本、图像和表格等异构信息构建知识图谱,结合语义修剪和结构化子图检索机制,显著提升了领域特定问答的性能。

Details Motivation: 通用大型语言模型(LLMs)在领域特定任务中存在知识幻觉和适应性不足的问题,限制了其在专业问答场景中的有效性。传统检索增强生成(RAG)在领域知识准确性和上下文建模方面仍有局限,因此需要一种更高效的领域特定解决方案。

Contribution: 提出了 DSRAG 框架,首次将多模态知识图谱与检索增强生成结合,通过构建覆盖概念层和实例层的知识图谱,并设计语义修剪和结构化子图检索机制,显著提升了领域特定问答的准确性和可靠性。

Method: 1. 利用领域特定文档构建多模态知识图谱,整合文本、图像和表格等信息。2. 引入语义修剪和结构化子图检索机制,结合知识图谱上下文和向量检索结果指导语言模型生成。

Result: 通过 Langfuse 多维评分机制的评估,DSRAG 在领域特定问答任务中表现出色,验证了多模态知识图谱与检索增强生成结合的有效性。

Insight: 知识图谱的质量和检索机制的设计对检索增强生成至关重要,多模态信息的整合能显著丰富领域知识的表达,提升模型的专业性和可靠性。

Abstract: Current general-purpose large language models (LLMs) commonly exhibit knowledge hallucination and insufficient domain-specific adaptability in domain-specific tasks, limiting their effectiveness in specialized question answering scenarios. Retrieval-augmented generation (RAG) effectively tackles these challenges by integrating external knowledge to enhance accuracy and relevance. However, traditional RAG still faces limitations in domain knowledge accuracy and context modeling.To enhance domain-specific question answering performance, this work focuses on a graph-based RAG framework, emphasizing the critical role of knowledge graph quality during the generation process. We propose DSRAG (Domain-Specific RAG), a multimodal knowledge graph-driven retrieval-augmented generation framework designed for domain-specific applications. Our approach leverages domain-specific documents as the primary knowledge source, integrating heterogeneous information such as text, images, and tables to construct a multimodal knowledge graph covering both conceptual and instance layers. Building on this foundation, we introduce semantic pruning and structured subgraph retrieval mechanisms, combining knowledge graph context and vector retrieval results to guide the language model towards producing more reliable responses. Evaluations using the Langfuse multidimensional scoring mechanism show that our method excels in domain-specific question answering, validating the efficacy of integrating multimodal knowledge graphs with retrieval-augmented generation.

[192] Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation

Yifan Liu,Yaokun Liu,Zelin Li,Zhenrui Yue,Gyuseok Lee,Ruichen Yao,Yang Zhang,Dong Wang

Main category: cs.IR

TL;DR: 论文提出DECOR框架,通过分解式上下文token表示学习,解决生成式推荐中tokenizer预训练与推荐训练目标不一致的问题,提升推荐性能。

Details Motivation: 生成式推荐中,tokenizer预训练与推荐训练的目标不一致,导致静态token化和预训练语义的丢失问题。

Contribution: 提出DECOR框架,通过上下文token组合和分解嵌入融合,保留预训练语义并增强token嵌入的适应性。

Method: 引入上下文token组合优化嵌入,并融合预训练代码本嵌入与协作嵌入。

Result: 在三个真实数据集上,DECOR显著优于现有基线方法。

Insight: 通过统一框架解决目标不一致问题,提升了生成式推荐的性能。

Abstract: Recent advances in generative recommenders adopt a two-stage paradigm: items are first tokenized into semantic IDs using a pretrained tokenizer, and then large language models (LLMs) are trained to generate the next item via sequence-to-sequence modeling. However, these two stages are optimized for different objectives: semantic reconstruction during tokenizer pretraining versus user interaction modeling during recommender training. This objective misalignment leads to two key limitations: (i) suboptimal static tokenization, where fixed token assignments fail to reflect diverse usage contexts; and (ii) discarded pretrained semantics, where pretrained knowledge - typically from language model embeddings - is overwritten during recommender training on user interactions. To address these limitations, we propose to learn DEcomposed COntextual Token Representations (DECOR), a unified framework that preserves pretrained semantics while enhancing the adaptability of token embeddings. DECOR introduces contextualized token composition to refine token embeddings based on user interaction context, and decomposed embedding fusion that integrates pretrained codebook embeddings with newly learned collaborative embeddings. Experiments on three real-world datasets demonstrate that DECOR consistently outperforms state-of-the-art baselines in recommendation performance. Our code will be made available upon publication.

[193] ReFineG: Synergizing Small Supervised Models and LLMs for Low-Resource Grounded Multimodal NER

Jielong Tang,Shuang Wang,Zhenxing Wang,Jianxing Yu,Jian Yin

Main category: cs.IR

TL;DR: ReFineG提出了一种三阶段协作框架,结合小型监督模型和冻结的多模态大语言模型(MLLM),以解决低资源多模态命名实体识别(GMNER)中的领域知识冲突和注释成本问题,并在CCKS2025共享任务中取得了第二名的成绩。

Details Motivation: 现有监督方法依赖昂贵的多模态注释且在低资源领域表现不佳,而MLLM虽然泛化能力强但容易因领域知识冲突产生错误结果。ReFineG旨在通过结合两者优势来解决这些问题。

Contribution: 1. 提出了三阶段协作框架(训练、精炼、接地);2. 设计了领域感知NER数据合成策略;3. 引入了不确定性机制和多模态上下文选择算法。

Method: 1. 训练阶段:利用合成数据将LLM知识迁移到小型监督模型;2. 精炼阶段:通过不确定性机制将高置信度预测保留,不确定部分交给MLLM;3. 接地阶段:基于类比推理增强视觉接地。

Result: 在CCKS2025 GMNER共享任务中,ReFineG以F1得分0.6461排名第二,验证了其在低资源场景下的有效性。

Insight: 通过将监督模型与MLLM协同工作,ReFineG展示了如何在保留领域知识的同时充分利用MLLM的泛化能力,为低资源多模态任务提供了新思路。

Abstract: Grounded Multimodal Named Entity Recognition (GMNER) extends traditional NER by jointly detecting textual mentions and grounding them to visual regions. While existing supervised methods achieve strong performance, they rely on costly multimodal annotations and often underperform in low-resource domains. Multimodal Large Language Models (MLLMs) show strong generalization but suffer from Domain Knowledge Conflict, producing redundant or incorrect mentions for domain-specific entities. To address these challenges, we propose ReFineG, a three-stage collaborative framework that integrates small supervised models with frozen MLLMs for low-resource GMNER. In the Training Stage, a domain-aware NER data synthesis strategy transfers LLM knowledge to small models with supervised training while avoiding domain knowledge conflicts. In the Refinement Stage, an uncertainty-based mechanism retains confident predictions from supervised models and delegates uncertain ones to the MLLM. In the Grounding Stage, a multimodal context selection algorithm enhances visual grounding through analogical reasoning. In the CCKS2025 GMNER Shared Task, ReFineG ranked second with an F1 score of 0.6461 on the online leaderboard, demonstrating its effectiveness with limited annotations.

q-bio.QM [Back]

[194] Introduction to a Low-Cost AI-Powered GUI for Unstained Cell Culture Analysis

Surajit Das,Pavel Zun

Main category: q-bio.QM

TL;DR: 该论文介绍了一个低成本、基于AI的无染色细胞培养分析GUI工具,适用于预算有限的实验室。该工具通过计算机视觉和机器学习技术实现了高效的细胞分析,无需手动标注数据或训练阶段。

Details Motivation: 为了满足低预算实验室的需求,提供一个无需昂贵设备和复杂操作的细胞分析解决方案,同时支持无染色细胞的高效分析。

Contribution: 提出了一种无需标注数据和训练的AI框架,实现了高效的语义和实例分割、特征提取及自动化报告生成。

Method: 基于Python的计算机视觉和机器学习管道,支持模块化架构和跨平台GUI,无需编程技能即可操作。

Result: 在公共数据集livecells上验证了其准确性和可重复性,性能优于Cellpose和StarDist等工具,且在CPU平台上运行速度快。

Insight: 该工具的轻量化和易用性使其在基础研究和临床应用(如细胞移植和肌肉再生治疗)中具有广泛潜力。

Abstract: This article presents a novel microscopy image analysis framework designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular architecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical applications – particularly in cell transplantation for personalized medicine and muscle regeneration therapies.

eess.IV [Back]

[195] MIDOG 2025 Track 2: A Deep Learning Model for Classification of Atypical and Normal Mitotic Figures under Class and Hardness Imbalances

Sujatha Kotte,Vangala Govindakrishnan Saipradeep,Vidushi Walia,Dhandapani Nandagopal,Thomas Joseph,Naveen Sivadasan,Bhagat Singh Lali

Main category: eess.IV

TL;DR: 该论文提出了一种基于ResNet的新型深度学习模型,用于分类正常和不典型有丝分裂图像,解决了类别和难度不平衡的挑战,展示了强大的性能。

Details Motivation: 准确的分类有丝分裂图像(正常和不典型)对数字病理学中的肿瘤预后至关重要,但由于形态差异微小及数据不平衡,开发鲁棒的模型具有挑战性。

Contribution: 提出了一种新颖的双任务模型,同时建模有丝分裂表型和实例难度,并采用焦点损失和数据增强来解决类别不平衡和提升泛化能力。

Method: 基于ResNet架构,设计了专门的分类头,并使用焦点损失处理类别不平衡,结合全面的数据增强提升模型鲁棒性。

Result: 在MIDOG 2025 Track 2数据集上,5折交叉验证的平衡准确率为0.8744,ROC AUC为0.9505,表现出稳定且泛化性强的性能。

Insight: 通过同时建模分类和难度,并解决数据不平衡,该方法为临床病理诊断提供了一种可靠的工具,强调了解决真实世界数据挑战的重要性。

Abstract: Motivation: Accurate classification of mitotic figures into normal and atypical types is crucial for tumor prognostication in digital pathology. However, developing robust deep learning models for this task is challenging due to the subtle morphological differences, as well as significant class and hardness imbalances in real-world histopathology datasets. Methods: We propose a novel deep learning approach based on a ResNet backbone with specialized classification heads. Our architecture uniquely models both the mitotic figure phenotype and the instance difficulty simultaneously. This method is specifically designed to handle the challenges of diverse tissue types, scanner variability, and imbalanced data. We employed focal loss to effectively mitigate the pronounced class imbalance, and a comprehensive data augmentation pipeline was implemented to enhance the model’s robustness and generalizability. Results: Our approach demonstrated strong and consistent performance. In a 5-fold cross-validation on the MIDOG 2025 Track 2 dataset, it achieved a mean balanced accuracy of 0.8744 +/- 0.0093 and an ROC AUC of 0.9505 +/- 0.029. The model showed robust generalization across preliminary leaderboard evaluations, achieving an overall balanced accuracy of 0.8736 +/- 0.0204. Conclusion: The proposed method offers a reliable and generalizable solution for the classification of atypical and normal mitotic figures. By addressing the inherent challenges of real world data, our approach has the potential to support precise prognostic assessments in clinical practice and improve consistency in pathological diagnosis.

[196] FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification

Prajit Sengupta,Islem Rekik

Main category: eess.IV

TL;DR: FireGNN是一种可解释的图神经网络框架,将可训练模糊规则嵌入GNN中,用于医学图像分类,同时实现高性能和可解释性。

Details Motivation: 医学图像分类需要高预测性能和可解释性以确保临床信任,传统GNN是黑箱,缺乏透明度。

Contribution: 首次在GNN中集成可训练模糊规则,提出结合符号推理的拓扑描述符,支持可解释性分类。

Method: FireGNN框架利用可学习阈值和锐度参数嵌入拓扑描述符(如节点度、聚类系数),并结合自监督任务评估拓扑学习。

Result: 在多个医学图像基准(MedMNIST、MorphoMNIST)上表现优异,并提供基于规则的解释。

Insight: 模糊规则与GNN的结合不仅提升性能,还增强了模型的可解释性,适合临床场景。

Abstract: Medical image classification requires not only high predictive performance but also interpretability to ensure clinical trust and adoption. Graph Neural Networks (GNNs) offer a powerful framework for modeling relational structures within datasets; however, standard GNNs often operate as black boxes, limiting transparency and usability, particularly in clinical settings. In this work, we present an interpretable graph-based learning framework named FireGNN that integrates trainable fuzzy rules into GNNs for medical image classification. These rules embed topological descriptors - node degree, clustering coefficient, and label agreement - using learnable thresholds and sharpness parameters to enable intrinsic symbolic reasoning. Additionally, we explore auxiliary self-supervised tasks (e.g., homophily prediction, similarity entropy) as a benchmark to evaluate the contribution of topological learning. Our fuzzy-rule-enhanced model achieves strong performance across five MedMNIST benchmarks and the synthetic dataset MorphoMNIST, while also generating interpretable rule-based explanations. To our knowledge, this is the first integration of trainable fuzzy rules within a GNN.

[197] Automated Cervical Os Segmentation for Camera-Guided, Speculum-Free Screening

Aoife McDonald-Bowyer,Anjana Wijekoon,Ryan Laurance Love,Katie Allan,Scott Colvin,Aleksandra Gentry-Maharaj,Adeola Olaitan,Danail Stoyanov,Agostino Stilli,Sophia Bano

Main category: eess.IV

TL;DR: 这篇论文研究了用于无窥器宫颈筛查的自动化宫颈口分割技术,通过比较深度学习模型,发现基于视觉Transformer的EndoViT/DPT性能最优,支持实时应用。

Details Motivation: 宫颈癌是高度可预防的疾病,但缺乏可靠的筛查工具限制了其消除进展。无窥器设备结合成像和采样可改善筛查普及率,但需要可靠的视觉引导技术。

Contribution: 论文的主要贡献是比较了五种编码器-解码器架构在宫颈口分割任务中的性能,提出EndoViT/DPT模型,验证其在实时应用中的可行性。

Method: 使用IARC宫颈图像数据集中的913帧图像,通过十折交叉验证评估模型性能,指标包括IoU、DICE、检测率和距离度量。

Result: EndoViT/DPT表现最佳,DICE得分为0.50±0.31,检测率为0.87±0.33,并在外部验证中表现出鲁棒性,速度达21.5 FPS。

Insight: 视觉Transformer在医学图像分割任务中表现优于传统CNN,为无窥器宫颈筛查设备的自动化视觉引导提供了技术基础。

Abstract: Cervical cancer is highly preventable, yet persistent barriers to screening limit progress toward elimination goals. Speculum-free devices that integrate imaging and sampling could improve access, particularly in low-resource settings, but require reliable visual guidance. This study evaluates deep learning methods for real-time segmentation of the cervical os in transvaginal endoscopic images. Five encoder-decoder architectures were compared using 913 frames from 200 cases in the IARC Cervical Image Dataset, annotated by gynaecologists. Performance was assessed using IoU, DICE, detection rate, and distance metrics with ten-fold cross-validation. EndoViT/DPT, a vision transformer pre-trained on surgical video, achieved the highest DICE (0.50 \pm 0.31) and detection rate (0.87 \pm 0.33), outperforming CNN-based approaches. External validation with phantom data demonstrated robust segmentation under variable conditions at 21.5 FPS, supporting real-time feasibility. These results establish a foundation for integrating automated os recognition into speculum-free cervical screening devices to support non-expert use in both high- and low-resource contexts.

[198] Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning

Jin Yang,Daniel S. Marcus,Aristeidis Sotiras

Main category: eess.IV

TL;DR: 论文提出了一种新的主动学习和半监督微调方法(ASFDA),用于高效适应医学视觉基础模型(Med-VFMs)到目标域,以提升体积医学图像分割性能。通过主动学习筛选高信息量样本,并结合选择性半监督微调,实现了在有限标注预算下的最优性能。

Details Motivation: 医学视觉基础模型(Med-VFMs)在自监督预训练中展现了强大能力,但在目标域适应任务中缺乏高效的样本选择和微调方法。现有方法通常随机选择样本,未能充分利用模型知识。因此,需要一种方法能高效选择信息量高的样本进行微调。

Contribution: 1. 提出ASFDA方法,结合主动学习和选择性半监督微调,高效适应Med-VFMs到目标域。2. 设计了两种主动学习查询指标(DKD和ASD),用于选择信息量高的样本。3. 通过选择性半监督微调,进一步提升模型性能和效率。

Method: 1. 主动学习阶段:使用DKD(衡量源-目标域知识差距和域内多样性)和ASD(评估解剖结构分割难度)选择高信息量样本。2. 选择性半监督微调:从未标注样本中筛选高可靠性样本,进行半监督微调。

Result: ASFDA方法在体积医学图像分割任务中表现出色,能以最小标注预算实现最优性能。

Insight: 1. 主动学习结合半监督微调能有效提升模型的适应能力。2. DKD和ASD指标的设计充分利用了预训练模型的知识,有助于筛选高质量样本。

Abstract: Medical Vision Foundation Models (Med-VFMs) have superior capabilities of interpreting medical images due to the knowledge learned from self-supervised pre-training with extensive unannotated images. To improve their performance on adaptive downstream evaluations, especially segmentation, a few samples from target domains are selected randomly for fine-tuning them. However, there lacks works to explore the way of adapting Med-VFMs to achieve the optimal performance on target domains efficiently. Thus, it is highly demanded to design an efficient way of fine-tuning Med-VFMs by selecting informative samples to maximize their adaptation performance on target domains. To achieve this, we propose an Active Source-Free Domain Adaptation (ASFDA) method to efficiently adapt Med-VFMs to target domains for volumetric medical image segmentation. This ASFDA employs a novel Active Learning (AL) method to select the most informative samples from target domains for fine-tuning Med-VFMs without the access to source pre-training samples, thus maximizing their performance with the minimal selection budget. In this AL method, we design an Active Test Time Sample Query strategy to select samples from the target domains via two query metrics, including Diversified Knowledge Divergence (DKD) and Anatomical Segmentation Difficulty (ASD). DKD is designed to measure the source-target knowledge gap and intra-domain diversity. It utilizes the knowledge of pre-training to guide the querying of source-dissimilar and semantic-diverse samples from the target domains. ASD is designed to evaluate the difficulty in segmentation of anatomical structures by measuring predictive entropy from foreground regions adaptively. Additionally, our ASFDA method employs a Selective Semi-supervised Fine-tuning to improve the performance and efficiency of fine-tuning by identifying samples with high reliability from unqueried ones.

[199] Data-driven Smile Design: Personalized Dental Aesthetics Outcomes Using Deep Learning

Marcus Lin,Jennifer Lai

Main category: eess.IV

TL;DR: 该论文提出了一种基于深度学习的微笑设计系统,通过整合AI和大数据,自动生成个性化的微笑设计方案,旨在减少传统方法对牙医专业知识的依赖,并减少结果的主观性。

Details Motivation: 传统的微笑设计依赖牙医的专业知识和手工操作,结果易受主观影响且效率低。数字化技术虽有所改进,但仍存在数据偏差和局限性,因此需要一种更自动化和个性化的解决方案。

Contribution: 论文的主要贡献是提出了一种综合系统,结合AI、大数据和识别技术,自动完成微笑设计过程,并通过面部特征提取模块和图像生成模块,满足不同用户需求。

Method: 系统主要分为两部分:1) 面部特征提取模块:利用深度学习技术分析患者的面部特征;2) 图像生成模块:基于提取的特征生成个性化的微笑设计方案。

Result: 该系统能够帮助经验丰富或无经验的牙医轻松生成美观的微笑设计方案,减少了传统方法的主观性和局限性。

Insight: 未来的研究方向包括优化设计、测试虚拟和增强现实的实时预览功能,以及利用用户数据进行美学偏好分析,进一步推动牙科微笑设计的科学化。

Abstract: A healthy smile plays a significant role in functional as well as esthetic considerations, improving confidence. It is difficult for dental professionals to strike a balance between esthetic requirements and functional requirements. Traditional smile design has had heavy reliance on dentist expertise and used plaster models and hand drawings, raising questions about the outcome for patients. Digital technology, led by Dr. Christian Coachman in 2007, allows photographic and videographic assessments, enabling improved intercommunication among specialists and patients. Advances in artificial intelligence (AI) and big data have supported analysis of facial features and development of personalized smile designs in the last few years. Outputs are, however, susceptible to practitioner bias or limitations of training data, and may be suboptimal for individual users. The study presented here suggests a comprehensive system integrating AI, big data, and recognition technologies to automate the smile design process so that both experienced and inexperienced dentists can generate pleasing aesthetics with ease. The system has a Facial Feature Extraction Module and an Image Generation Module, serving diverse practitioner and patient needs. User data can be incorporated in future research for design optimization and testing of virtual and augmented reality for real-time previewing. Data gathered can also be employed in aesthetic preference analyses, which can enhance our knowledge of smile design in dental practice.