Table of Contents

cs.CL [Back]

[1] Fusion-Augmented Large Language Models: Boosting Diagnostic Trustworthiness via Model Consensus

Md Kamrul Siam,Md Jobair Hossain Faruk,Jerry Q. Cheng,Huanying Gu

Main category: cs.CL

TL;DR: 该研究提出了一种多模型融合框架,结合ChatGPT和Claude两种大型语言模型(LLMs),提升了胸部X光片诊断的可靠性。通过相似性共识方法,诊断准确率从单模型的62.8%和76.9%提升至77.6%甚至91.3%。

Details Motivation: 提升AI辅助放射学诊断的可信度,减少诊断错误,同时保持较低的计算开销。

Contribution: 引入多模型融合框架,利用输出级共识提升诊断准确性;验证多模态输入(图像+合成文本)对性能的进一步改善。

Method: 通过相似性阈值(95%)实现模型共识;结合图像和合成临床笔记的多模态输入进行评估。

Result: 共识方法在单模态和多模态条件下分别将准确率提升至77.6%和91.3%,显著优于单一模型。

Insight: 模型共识和多模态输入的结合是提升AI诊断可靠性的有效途径,且无需复杂计算开销。

Abstract: This study presents a novel multi-model fusion framework leveraging two state-of-the-art large language models (LLMs), ChatGPT and Claude, to enhance the reliability of chest X-ray interpretation on the CheXpert dataset. From the full CheXpert corpus of 224,316 chest radiographs, we randomly selected 234 radiologist-annotated studies to evaluate unimodal performance using image-only prompts. In this setting, ChatGPT and Claude achieved diagnostic accuracies of 62.8% and 76.9%, respectively. A similarity-based consensus approach, using a 95% output similarity threshold, improved accuracy to 77.6%. To assess the impact of multimodal inputs, we then generated synthetic clinical notes following the MIMIC-CXR template and evaluated a separate subset of 50 randomly selected cases paired with both images and synthetic text. On this multimodal cohort, performance improved to 84% for ChatGPT and 76% for Claude, while consensus accuracy reached 91.3%. Across both experimental conditions, agreement-based fusion consistently outperformed individual models. These findings highlight the utility of integrating complementary modalities and using output-level consensus to improve the trustworthiness and clinical utility of AI-assisted radiological diagnosis, offering a practical path to reduce diagnostic errors with minimal computational overhead.

[2] Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

Guiyao Tie,Zenghui Yuan,Zeli Zhao,Chaoran Hu,Tianhe Gu,Ruihang Zhang,Sizhe Zhang,Junran Wu,Xiaoyue Tu,Ming Jin,Qingsong Wen,Lixing Chen,Pan Zhou,Lichao Sun

Main category: cs.CL

TL;DR: 论文提出了CorrectBench基准,评估大语言模型(LLMs)自我修正策略的有效性,发现自我修正可提高复杂推理任务的准确性,但混合策略会降低效率,且CoT基线方法表现优异。

Details Motivation: 自我修正是提升LLMs推理能力的关键,但目前缺乏对其方法的全面评估,且LLMs是否能真正自我修正仍存疑问。

Contribution: 1. 提出CorrectBench基准,评估三种自我修正策略(内在、外部和微调)的效果;2. 发现混合策略虽能提升效果但降低效率;3. 揭示了CoT基线方法的竞争力。

Method: 使用CorrectBench基准测试自我修正策略在常识推理、数学推理和代码生成任务中的表现,比较混合策略与单一策略的效果。

Result: 自我修正能提高准确性,但混合策略效率低;CoT基线方法表现优异;推理LLMs优化有限且时间成本高。

Insight: 自我修正虽能提升LLMs推理能力,但需平衡准确性与效率;未来研究应聚焦优化两者的平衡。

Abstract: Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM’s reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: https://correctbench.github.io/

[3] EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu,Xiaoman Wang,Jianbiao Mei,Pinlong Cai,Daocheng Fu,Cheng Yang,Licheng Wen,Xuemeng Yang,Yufan Shen,Yuxin Wang,Botian Shi

Main category: cs.CL

TL;DR: EvolveR 是一个支持 LLM 代理通过经验驱动的生命周期自我进化的框架,包含离线自蒸馏和在线交互两个阶段,显著提升了复杂任务中的性能。

Details Motivation: 现有的 LLM 代理虽然在使用工具方面表现出色,但缺乏系统性学习自身经验的能力,无法迭代优化问题解决策略。

Contribution: 提出了 EvolveR 框架,通过闭环生命周期(离线自蒸馏和在线交互)使代理能够自我改进,并在多跳问答基准测试中表现优异。

Method: 结合离线自蒸馏(提炼交互轨迹为抽象策略库)和在线交互(动态检索策略指导决策),通过策略强化机制迭代更新代理。

Result: 在多跳问答任务中超越基线方法,证明了框架的有效性。

Insight: 通过从自身行为后果中学习,代理可以实现更高水平的自主性和持续改进。

Abstract: Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent’s interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at https://github.com/Edaizi/EvolveR.

[4] EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture

Mohamed Gamil,Abdelrahman Elsayed,Abdelrahman Lila,Ahmed Gad,Hesham Abdelgawad,Mohamed Aref,Ahmed Fares

Main category: cs.CL

TL;DR: 论文介绍了EgMM-Corpus,一个针对埃及文化的多模态数据集,包含3000多张图片,涵盖313个文化概念,旨在解决中东和非洲地区多样化数据不足的问题。

Details Motivation: 尽管AI近期取得了进展,但多模态文化多样性数据集仍然有限,尤其是在中东和非洲地区。

Contribution: 提出了EgMM-Corpus,一个专注于埃及文化的多模态数据集,包含3000多张图像,覆盖313个文化概念(如地标、食物和民俗),并提供了手动验证的文化真实性和多模态一致性。

Method: 设计并运行了一个新的数据收集流程,并手动验证数据质量。此外,使用CLIP模型进行零样本性能评估。

Result: CLIP在EgMM-Corpus上的Top-1和Top-5分类准确率分别为21.2%和36.4%,揭示了大规模视觉-语言模型中存在的文化偏差。

Insight: EgMM-Corpus为开发具有文化意识的模型提供了重要基准,同时凸显了现有模型在跨文化任务中的局限性。

Abstract: Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.

[5] What Can String Probability Tell Us About Grammaticality?

Jennifer Hu,Ethan Gotlieb Wilcox,Siyuan Song,Kyle Mahowald,Roger P. Levy

Main category: cs.CL

TL;DR: 这篇论文探讨了语言模型(LM)的概率与语言学中的语法性之间的关系,提出了三个基于最小句对的理论预测,并在英语和汉语数据上验证了这些预测。

Details Motivation: 语言学中概率和语法性是两个独立的概念,研究希望通过分析LM生成的字符串概率,揭示其对语法知识的理解程度。

Contribution: 提出了一个理论框架,分析语法、意义和字符串概率之间的关系,并通过实验验证了三个预测:(1)最小句对中字符串概率的相关性;(2)模型与人类在最小句对中的差异相关性;(3)语法正确与错误字符串在概率空间中的分离不明显。

Method: 基于语料库生成过程的简单假设,对280K英语和汉语句对进行实证分析,验证理论预测。

Result: 实验结果支持所有三个预测,表明LM的概率可以部分反映其语法知识。

Insight: 研究为利用概率分析LM的结构知识提供了理论基础,并指出了未来在LM语法评估中的研究方向。

Abstract: What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM’s underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models’ and humans’ deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs’ structural knowledge, and suggest directions for future work in LM grammatical evaluation.

[6] Towards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback

Chu Fei Luo,Samuel Dahan,Xiaodan Zhu

Main category: cs.CL

TL;DR: 本文提出了一种低资源环境下语言模型与多样化观点对齐的方法,通过多元解码和模型引导减少假阳性错误,并提升对人类价值观的对齐。

Details Motivation: 现代语言模型的训练范式通常假设每个查询只有一个最优答案,导致生成泛化回应且对齐效果不佳。本文旨在解决这一问题,促进语言模型在低资源环境下与多样化观点的对齐。

Contribution: 1. 提出多元解码和模型引导两种方法,增强语言模型的多元对齐能力。2. 在低资源环境下(仅需50个标注样本)显著提升对齐效果。

Method: 1. 多元解码:生成多样化回应以捕捉不同观点。2. 模型引导:通过少量高质量数据调整模型行为。

Result: 实验表明,模型引导在零样本和少样本基线上表现一致提升,减少了仇恨言论和错误信息检测中的假阳性,并在GlobalOpinionQA中提升了人类价值观的对齐分布。

Insight: 本文强调了多样性的重要性,并展示了语言模型如何在小规模标注数据下适应复杂观点。

Abstract: As language models have a greater impact on society, it is important to ensure they are aligned to a diverse range of perspectives and are able to reflect nuance in human values. However, the most popular training paradigms for modern language models often assume there is one optimal answer for every query, leading to generic responses and poor alignment. In this work, we aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering. We empirically demonstrate that model steering offers consistent improvement over zero-shot and few-shot baselines with only 50 annotated samples. Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection, and improves the distributional alignment to human values in GlobalOpinionQA. We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.

[7] Instant Personalized Large Language Model Adaptation via Hypernetwork

Zhaoxuan Tan,Zixuan Zhang,Haoyang Wen,Zheng Li,Rongzhi Zhang,Pei Chen,Fengran Mo,Zheyuan Liu,Qingkai Zeng,Qingyu Yin,Meng Jiang

Main category: cs.CL

TL;DR: 该论文提出了一种名为Profile-to-PEFT的框架,通过超网络将用户配置文件映射到适配器参数,实现即时个性化LLM适配,避免了传统方法的计算开销。

Details Motivation: 现有参数高效微调方法(如OPPU)需为每个用户训练单独适配器,计算成本高且难以实时更新,不适合大规模应用。

Contribution: 提出Profile-to-PEFT框架,利用超网络生成适配器参数,实现即时个性化适配,且支持泛化和隐私保护。

Method: 使用端到端训练的超网络,将用户配置文件直接映射到LoRA等适配器参数,无需为每个用户单独训练。

Result: 实验表明,该方法在性能和计算效率上优于基于提示的个人化和OPPU,且能泛化到未见用户。

Insight: 超网络方法是实现高效、可扩展LLM个性化的有效途径,尤其适合大规模实时应用。

Abstract: Personalized large language models (LLMs) tailor content to individual preferences using user profiles or histories. However, existing parameter-efficient fine-tuning (PEFT) methods, such as the ``One-PEFT-Per-User’’ (OPPU) paradigm, require training a separate adapter for each user, making them computationally expensive and impractical for real-time updates. We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user’s encoded profile directly to a full set of adapter parameters (e.g., LoRA), eliminating per-user training at deployment. This design enables instant adaptation, generalization to unseen users, and privacy-preserving local deployment. Experimental results demonstrate that our method outperforms both prompt-based personalization and OPPU while using substantially fewer computational resources at deployment. The framework exhibits strong generalization to out-of-distribution users and maintains robustness across varying user activity levels and different embedding backbones. The proposed Profile-to-PEFT framework enables efficient, scalable, and adaptive LLM personalization suitable for large-scale applications.

[8] Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models

Pratham Singla,Shivank Garg,Ayush Singh,Ishan Garg,Ketan Suhaas Saichandran

Main category: cs.CL

TL;DR: 论文探讨了后训练语言模型在逻辑推理任务中的能力,定义了三个核心能力,并通过实验对比了不同后训练方法的模型表现。

Details Motivation: 研究旨在评估大型语言模型是否真正‘理解’和‘思考’其学到的策略,特别是在逻辑密集型任务中。

Contribution: 定义了模型的核心能力(策略意识、跨域泛化、推理轨迹与输出的对齐),并通过实验对比了SFT、DPO和GRPO后训练方法的模型表现。

Method: 设计了多个需要学习不同策略的任务,比较了SFT、DPO和GRPO训练的模型在三个核心能力上的表现。

Result: RL训练的模型(尤其是DPO和GRPO)在策略意识和泛化能力上优于SFT模型,但在推理轨迹与输出的对齐上较弱。

Insight: 模型的后训练方法对其策略意识、泛化和推理对齐能力有显著影响,GRPO在泛化上表现最佳,但对齐最弱。

Abstract: Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they “learn” and “think”? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.

[9] End-to-End Argument Mining through Autoregressive Argumentative Structure Prediction

Nilmadhab Das,Vishal Vaibhav,Yash Sunil Choudhary,V. Vijaya Saradhi,Ashish Anand

Main category: cs.CL

TL;DR: 该论文提出了自回归论证结构预测(AASP)框架,用于端到端的论证挖掘任务,通过预定义的动作集逐步构建论证结构,实现了依赖关系的建模,并在多个基准测试中取得了最优或接近最优的结果。

Details Motivation: 论证挖掘(AM)任务涉及提取复杂的论证结构和关系,但现有方法往往忽略组件和关系之间的依赖关系。论文旨在通过自回归框架更高效地建模这些依赖关系。

Contribution: 提出了AASP框架,将论证挖掘任务联合建模为自回归的步骤动作序列,利用预训练语言模型逐步构建论证结构,实现了端到端的解决方案。

Method: 基于自回归结构预测框架,设计预定义的动作集,通过条件预训练语言模型逐步生成论证结构和关系,捕捉推理流程。

Result: 在三个标准AM基准测试中,AASP在两个测试中达到最优结果,在另一个测试中表现优异。

Insight: 自回归方法可以有效建模论证结构的依赖关系,预定义动作为逐步生成提供了一种高效的方式。

Abstract: Argument Mining (AM) helps in automating the extraction of complex argumentative structures such as Argument Components (ACs) like Premise, Claim etc. and Argumentative Relations (ARs) like Support, Attack etc. in an argumentative text. Due to the inherent complexity of reasoning involved with this task, modelling dependencies between ACs and ARs is challenging. Most of the recent approaches formulate this task through a generative paradigm by flattening the argumentative structures. In contrast to that, this study jointly formulates the key tasks of AM in an end-to-end fashion using Autoregressive Argumentative Structure Prediction (AASP) framework. The proposed AASP framework is based on the autoregressive structure prediction framework that has given good performance for several NLP tasks. AASP framework models the argumentative structures as constrained pre-defined sets of actions with the help of a conditional pre-trained language model. These actions build the argumentative structures step-by-step in an autoregressive manner to capture the flow of argumentative reasoning in an efficient way. Extensive experiments conducted on three standard AM benchmarks demonstrate that AASP achieves state-of-theart (SoTA) results across all AM tasks in two benchmarks and delivers strong results in one benchmark.

[10] Navigating through the hidden embedding space: steering LLMs to improve mental health assessment

Federico Ravenda,Seyed Ali Bahrainian,Andrea Raballo,Antonietta Mira

Main category: cs.CL

TL;DR: 该论文提出了一种轻量级方法,通过线性变换和导向向量调整LLM的隐藏层激活,以提升其在心理健康评估中的表现。

Details Motivation: 尽管LLMs在AI领域发展迅速,但小规模模型在特定领域(如心理健康)的表现仍有不足。论文旨在通过低成本方法提升LLM的MH评估能力。

Contribution: 提出了一个无需计算密集型技术的轻量级方法,通过导向向量调整隐藏层激活,显著提升LLM在心理健康任务中的表现。

Method: 采用线性变换作用于特定层的激活,利用导向向量引导模型输出,应用于Reddit帖子相关性预测和抑郁筛查问卷填写任务。

Result: 该方法在两个任务中均取得改善效果,展示了导向机制在LLM领域适应中的高效潜力。

Insight: 证明了隐藏层激活的简单调整可以显著提升LLM在敏感领域(如心理健康)的性能,为低成本域适应提供了新思路。

Abstract: The rapid evolution of Large Language Models (LLMs) is transforming AI, opening new opportunities in sensitive and high-impact areas such as Mental Health (MH). Yet, despite these advancements, recent evidence reveals that smaller-scale models still struggle to deliver optimal performance in domain-specific applications. In this study, we present a cost-efficient yet powerful approach to improve MH assessment capabilities of an LLM, without relying on any computationally intensive techniques. Our lightweight method consists of a linear transformation applied to a specific layer’s activations, leveraging steering vectors to guide the model’s output. Remarkably, this intervention enables the model to achieve improved results across two distinct tasks: (1) identifying whether a Reddit post is useful for detecting the presence or absence of depressive symptoms (relevance prediction task), and (2) completing a standardized psychological screening questionnaire for depression based on users’ Reddit post history (questionnaire completion task). Results highlight the untapped potential of steering mechanisms as computationally efficient tools for LLMs’ MH domain adaptation.

[11] MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Yu Ying Chiu,Michael S. Lee,Rachel Calcott,Brandon Handoko,Paul de Font-Reaulx,Paula Rodriguez,Chen Bo Calvin Zhang,Ziwen Han,Udari Madhushani Sehwag,Yash Maurya,Christina Q Knight,Harry R. Lloyd,Florence Bacus,Mantas Mazeika,Bing Liu,Yejin Choi,Mitchell L Gordon,Sydney Levine

Main category: cs.CL

TL;DR: MoReBench是一个评测框架,专注于评估语言模型在道德决策中的过程和多元推理能力,而不仅仅是结果。

Details Motivation: 随着AI系统在决策中的作用日益重要,需要理解其决策过程是否符合人类价值观,特别是在道德困境中。

Contribution: 1. 提出MoReBench,包含1,000个道德场景和23,000多条评测标准;2. 提出MoReBench-Theory,测试AI在五种伦理框架下的推理能力;3. 发现现有评测标准和扩展规律未能预测AI的道德推理能力。

Method: 1. 构建MoReBench数据集,覆盖道德考量的识别、权衡和行动建议;2. 评测模型在不同伦理框架下的表现;3. 对比现有评测标准的效果。

Result: AI模型在道德推理中表现出对特定伦理框架(如功利主义和义务论)的偏好,现有评测标准无法预测其道德推理能力。

Insight: 研究强调了对AI决策过程的透明度需求,揭示了训练范式可能带来的伦理偏见,为更安全的AI发展提供参考。

Abstract: As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models’ abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

[12] ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents

David Peer,Sebastian Stabinger

Main category: cs.CL

TL;DR: ATA是一种神经符号方法,通过将任务分解为离线的知识提取和在线任务处理两个阶段,解决了LLM在高风险领域中可信度不足的问题,实现了高效且可信的自主代理。

Details Motivation: LLM在高风险领域的应用中存在幻觉、不稳定性和缺乏透明度等问题,限制了其可信度。ATA旨在通过神经符号方法解决这些问题。

Contribution: ATA提出了一种通用的神经符号框架,通过分离知识提取和任务处理,生成可验证的符号知识库,并结合符号决策引擎实现可靠的推理。

Method: 方法分为两阶段:1) 离线阶段,LLM将非正式问题转化为形式化的知识库;2) 在线阶段,输入被编码为形式语言,符号决策引擎基于知识库和输入生成结果。

Result: 实验表明,ATA在自动化设置中与最先进的端到端推理模型相当,且在人类验证过的知识库支持下显著优于更大的模型,同时具有完美的确定性、稳定性和抗提示注入攻击的能力。

Insight: ATA展示了神经符号方法在提升LLM可信度方面的潜力,为构建透明、可审计且可靠的自主代理提供了一种可行架构。

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities, yet their deployment in high-stakes domains is hindered by inherent limitations in trustworthiness, including hallucinations, instability, and a lack of transparency. To address these challenges, we introduce a generic neuro-symbolic approach, which we call Autonomous Trustworthy Agents (ATA). The core of our approach lies in decoupling tasks into two distinct phases: Offline knowledge ingestion and online task processing. During knowledge ingestion, an LLM translates an informal problem specification into a formal, symbolic knowledge base. This formal representation is crucial as it can be verified and refined by human experts, ensuring its correctness and alignment with domain requirements. In the subsequent task processing phase, each incoming input is encoded into the same formal language. A symbolic decision engine then utilizes this encoded input in conjunction with the formal knowledge base to derive a reliable result. Through an extensive evaluation on a complex reasoning task, we demonstrate that a concrete implementation of ATA is competitive with state-of-the-art end-to-end reasoning models in a fully automated setup while maintaining trustworthiness. Crucially, with a human-verified and corrected knowledge base, our approach significantly outperforms even larger models, while exhibiting perfect determinism, enhanced stability against input perturbations, and inherent immunity to prompt injection attacks. By generating decisions grounded in symbolic reasoning, ATA offers a practical and controllable architecture for building the next generation of transparent, auditable, and reliable autonomous agents.

[13] Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment

Fu-An Chao,Bi-Cheng Yan,Berlin Chen

Main category: cs.CL

TL;DR: 该论文探讨了Whisper(一种自动语音识别基础模型)在第二语言口语评估中的潜力,通过提取隐藏表示的声学和语言特征,仅需轻量级分类器即可在GEPT数据集上超越现有方法。

Details Motivation: 现有研究主要关注Whisper的转录能力,但其在口语评估(SLA)中的潜在能力尚未充分挖掘。论文旨在探索Whisper隐藏表示中对语言能力的编码能力。

Contribution: 1. 提出了一种基于Whisper隐藏表示的轻量级评估方法;2. 展示了通过多模态辅助信息进一步提升性能;3. 揭示了Whisper未经微调即可编码语言能力等级和语义信息。

Method: 从Whisper的中间和最终输出中提取声学和语言特征,结合轻量级分类器进行评估,并利用图像和文本提示作为辅助信息。

Result: 在GEPT数据集上性能优于现有方法,包括多模态基线。嵌入分析显示Whisper能无缝编码语言能力等级和语义信息。

Insight: Whisper的隐藏表示蕴含丰富的语言能力信息,表明其在口语评估和其他语言理解任务中的潜力,无需任务特定微调。

Abstract: In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper’s intermediate and final outputs, our method achieves strong performance on the GEPT picture-description dataset, outperforming existing cutting-edge baselines, including a multimodal approach. Furthermore, by incorporating image and text-prompt information as auxiliary relevance cues, we demonstrate additional performance gains. Finally, we conduct an in-depth analysis of Whisper’s embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech, highlighting its potential as a powerful foundation for SLA and other spoken language understanding tasks.

[14] FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

Syed Rifat Raiyan,Md Farhan Ishmam,Abdullah Al Imran,Mohammad Ali Moni

Main category: cs.CL

TL;DR: 论文提出了FrugalPrompt框架,通过保留语义权重最高的token来压缩提示词,减少LLMs的上下文开销,并在多个NLP任务中验证了其有效性。

Details Motivation: 大型语言模型(LLMs)的性能依赖于冗长的输入上下文,但这也带来了高昂的成本和延迟问题。许多提示词中存在冗余的低效用token,导致效率低下。

Contribution: 提出了FrugalPrompt框架,利用GlobEnc和DecompX两种token显著性评分方法,压缩提示词,同时保留高语义权重的token。

Method: 通过对输入序列中的token进行显著性评分和排序,保留top-k%的高显著性token,形成稀疏化的提示词。

Result: 在情感分析、常识问答和摘要生成任务中,压缩20%的提示词仅导致性能轻微下降;数学推理任务则性能显著下降。

Insight: 研究揭示了不同任务对上下文稀疏性的容忍度差异,指出常规NLP任务可能依赖于预训练的记忆模式,而非完整上下文。

Abstract: Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: https://github.com/Starscream-11813/Frugal-ICL

[15] TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model

Bin Yu,Xinming Wang,Shijie Lian,Haotian Li,Changti Wu,Ruina Hu,Bailing Wang,Yuliang Wei,Kai Chen

Main category: cs.CL

TL;DR: TrajSelector是一个高效且有效的Best-of-N框架,利用LLM的隐藏状态进行过程级评分,通过轻量级验证器选择最佳推理轨迹,显著提升了性能和效率。

Details Motivation: 现有的Best-of-N选择方法在性能提升的同时面临高计算开销和LLM潜在表征未充分利用的问题,TrajSelector旨在解决这些局限性。

Contribution: 提出TrajSelector框架,利用LLM的隐藏状态进行过程级评分,并通过轻量级验证器(仅0.6B参数)选择最佳推理轨迹,显著提升了性能和效率。

Method: 采用数据驱动的端到端训练方法,利用LLM的隐藏状态生成步级评分,并通过轻量级验证器聚合评分选择最佳轨迹。

Result: 在Best-of-32设置中,TrajSelector比多数投票准确性高4.61%,比现有过程奖励模型高4.31%至12.21%,同时降低了推理成本。

Insight: 充分利用LLM的潜在表征可以高效提升推理任务的性能,而轻量级验证器是实现高效选择的关键。

Abstract: Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM’s intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.

[16] RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning

Deyi Ji,Yuekui Yang,Haiyang Wu,Shaoping Ma,Tianrun Chen,Lanyun Zhu

Main category: cs.CL

TL;DR: RAVEN是一个新颖的框架,结合课程强化学习和多模态大语言模型(MLLMs)来提升广告视频违规检测的推理和认知能力。

Details Motivation: 现有的广告视频违规检测方法在时间定位、噪声标注和泛化能力方面存在局限,亟需改进。

Contribution: 1. 提出整合课程强化学习和MLLMs的RAVEN框架。2. 引入Group Relative Policy Optimization(GRPO)来增强推理能力。3. 设计了层次化奖励机制,确保精确的时间定位和类别预测。

Method: 采用课程强化学习策略,结合精确和粗略标注数据,利用GRPO和多层次奖励机制。

Result: 在工业和公共数据集上,RAVEN在违规类别准确性和时间定位方面表现优异,并通过在线A/B测试验证了其实用性。

Insight: 课程强化学习和GRPO的结合可以有效提升模型的推理能力和泛化性,同时缓解监督微调中的灾难性遗忘问题。

Abstract: Advertisement (Ad) video violation detection is critical for ensuring platform compliance, but existing methods struggle with precise temporal grounding, noisy annotations, and limited generalization. We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection. RAVEN employs a progressive training strategy, combining precisely and coarsely annotated data, and leverages Group Relative Policy Optimization (GRPO) to develop emergent reasoning abilities without explicit reasoning annotations. Multiple hierarchical sophisticated reward mechanism ensures precise temporal grounding and consistent category prediction. Experiments on industrial datasets and public benchmarks show that RAVEN achieves superior performances in violation category accuracy and temporal interval localization. We also design a pipeline to deploy the RAVEN on the online Ad services, and online A/B testing further validates its practical applicability, with significant improvements in precision and recall. RAVEN also demonstrates strong generalization, mitigating the catastrophic forgetting issue associated with supervised fine-tuning.

[17] Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations

Pingjun Hong,Beiduo Chen,Siyao Peng,Marie-Catherine de Marneffe,Benjamin Roth,Barbara Plank

Main category: cs.CL

TL;DR: 本文通过解释性方法分解了自然语言推理(NLI)任务中人类标注的变异性,探讨了标注者在标签选择和推理类型上的分歧,揭示了表面标签不一致下可能隐藏的解释一致性。

Details Motivation: 自然语言推理数据集常表现出人类标注的变异性。为了更好地理解这种变异性,本文通过解释性方法分析标注者决策背后的推理过程。

Contribution: 本文扩展了LiTEx分类法的应用范围,不仅关注标签内变异性(标注者同意标签但解释不同),还研究了标签选择和推理类型上的分歧。通过多角度对齐标注变异性(标签一致性、解释相似性和分类法一致性),揭示了标注者的个体偏好和推理策略。

Method: 使用LiTEx分类法对两个NLI英语数据集进行分析,从NLI标签一致性、解释相似性、分类法一致性以及标注者选择偏差等多个维度对齐标注变异性。

Result: 研究发现,标注者可能在标签上存在分歧,但其解释高度相似,表明表面分歧可能掩盖了深层次的理解一致性。此外,还揭示了标注者在解释策略和标签选择上的个体偏好。

Insight: 研究发现,推理类型的一致性比标签一致性更能反映自由文本解释的语义相似性,强调了基于推理的解释的丰富性,并提醒不应将标签视为绝对真实。

Abstract: Natural Language Inference datasets often exhibit human label variation. To better understand these variations, explanation-based approaches analyze the underlying reasoning behind annotators’ decisions. One such approach is the LiTEx taxonomy, which categorizes free-text explanations in English into reasoning types. However, previous work applying such taxonomies has focused on within-label variation: cases where annotators agree on the final NLI label but provide different explanations. In contrast, this paper broadens the scope by examining how annotators may diverge not only in the reasoning type but also in the labeling step. We use explanations as a lens to decompose the reasoning process underlying NLI annotation and to analyze individual differences. We apply LiTEx to two NLI English datasets and align annotation variation from multiple aspects: NLI label agreement, explanation similarity, and taxonomy agreement, with an additional compounding factor of annotators’ selection bias. We observe instances where annotators disagree on the label but provide highly similar explanations, suggesting that surface-level disagreement may mask underlying agreement in interpretation. Moreover, our analysis reveals individual preferences in explanation strategies and label choices. These findings highlight that agreement in reasoning types better reflects the semantic similarity of free-text explanations than label agreement alone. Our findings underscore the richness of reasoning-based explanations and the need for caution in treating labels as ground truth.

[18] Unleashing Diverse Thinking Modes in LLMs through Multi-Agent Collaboration

Zhixuan He,Yue Feng

Main category: cs.CL

TL;DR: 这篇论文提出了一个多智能体协作框架DiMo,通过模拟四个推理范式不同的LLM智能体的结构化辩论,提升模型的性能和解释性。

Details Motivation: 尽管大型语言模型(LLM)性能强大,但其推理过程通常难以解释。DiMo旨在通过多智能体协作,模拟人类辩论过程,增强模型的推理透明度和准确性。

Contribution: 主要贡献是提出了DiMo框架,通过多智能体的结构化辩论探索多样化的认知模式,从而提升模型的性能和生成可解释的推理链条。

Method: DiMo框架包含四个专精不同推理范式的LLM智能体,通过迭代辩论挑战和优化初始回答,生成更鲁棒的结论和可追溯的证据链。

Result: DiMo在六个基准测试中表现优于单模型和传统辩论基线,尤其在数学任务上提升显著。

Insight: DiMo展示了多智能体协作在增强LLM推理能力和解释性方面的潜力,并为未来结合检索增强推理和知识图谱的Web原生系统提供了方向。

Abstract: Large Language Models (LLMs) demonstrate strong performance but often lack interpretable reasoning. This paper introduces the Multi-Agent Collaboration Framework for Diverse Thinking Modes (DiMo), which enhances both performance and interpretability by simulating a structured debate among four specialized LLM agents. Each agent embodies a distinct reasoning paradigm, allowing the framework to collaboratively explore diverse cognitive approaches. Through iterative debate, agents challenge and refine initial responses, yielding more robust conclusions and an explicit, auditable reasoning chain. Across six benchmarks and under a unified open-source setup, DiMo improves accuracy over widely used single-model and debate baselines, with the largest gains on math. We position DiMo as a semantics-aware, Web-native multi-agent framework: it models human-machine intelligence with LLM agents that produce semantically typed, URL-annotated evidence chains for explanations and user-friendly interactions. Although our experiments use standard reasoning benchmarks, the framework is designed to be instantiated over Web corpora and knowledge graphs, combining retrieval-augmented reasoning with structured justifications that downstream systems can inspect and reuse.

[19] All You Need is One: Capsule Prompt Tuning with a Single Vector

Yiyang Liu,James C. Liang,Heng Fan,Wenhao Yang,Yiming Cui,Xiaotian Han,Lifu Huang,Dongfang Liu,Qifan Wang,Cheng Han

Main category: cs.CL

TL;DR: 该论文提出了一种名为Capsule Prompt-Tuning(CaPT)的高效且参数极少的提示调优方法,仅使用单个向量(胶囊提示)将实例感知和任务感知信息结合,显著提升了语言模型的性能。

Details Motivation: 当前基于提示的学习方法需要大量参数和网格搜索以确定最佳提示长度,且缺乏实例感知信息,限制了性能。作者发现引入实例感知信息可以增强模型性能,并提出了一种更高效的解决方案。

Contribution: 提出了CaPT方法,通过一个单一的胶囊提示,在参数极少的情况下结合实例感知和任务感知信息,提升了模型性能。

Method: CaPT利用现成的实例语义信息,将其与任务相关的提示结合,形成一个胶囊提示。该方法无需额外调参,仅使用一个向量即可完成任务。

Result: 在多个语言任务中表现出色(例如T5-Large上平均准确率达84.03%),同时参数效率极高(例如仅占Llama3.2-1B参数的0.003%)。

Insight: 论文揭示了’注意力锚’现象,即在序列最早位置引入实例感知信息可以增强对关键结构信息的注意力,并促进更活跃的注意力交互。

Abstract: Prompt-based learning has emerged as a parameter-efficient finetuning (PEFT) approach to facilitate Large Language Model (LLM) adaptation to downstream tasks by conditioning generation with task-aware guidance. Despite its successes, current prompt-based learning methods heavily rely on laborious grid searching for optimal prompt length and typically require considerable number of prompts, introducing additional computational burden. Worse yet, our pioneer findings indicate that the task-aware prompt design is inherently limited by its absence of instance-aware information, leading to a subtle attention interplay with the input sequence. In contrast, simply incorporating instance-aware information as a part of the guidance can enhance the prompt-tuned model performance without additional fine-tuning. Moreover, we find an interesting phenomenon, namely “attention anchor”, that incorporating instance-aware tokens at the earliest position of the sequence can successfully preserve strong attention to critical structural information and exhibit more active attention interaction with all input tokens. In light of our observation, we introduce Capsule Prompt-Tuning (CaPT), an efficient and effective solution that leverages off-the-shelf, informative instance semantics into prompt-based learning. Our approach innovatively integrates both instance-aware and task-aware information in a nearly parameter-free manner (i.e., one single capsule prompt). Empirical results demonstrate that our method can exhibit superior performance across various language tasks (e.g., 84.03% average accuracy on T5-Large), serving as an “attention anchor,” while enjoying high parameter efficiency (e.g., 0.003% of model parameters on Llama3.2-1B).

[20] Temporal Understanding under Deictic Frame of Reference

Damin Zhang,Julia Rayz

Main category: cs.CL

TL;DR: 论文提出了一种框架TUuD,用于评估大语言模型(LLM)在动态时间参考点变化下理解时间-事件和事件-事件关系的能力,发现模型对人时间认知的部分模拟能力。

Details Motivation: 人类通过空间隐喻理解时间,但LLM在时间推理方面的能力有限。本文旨在研究LLM在动态时间参考点('now')变化时如何处理时间关系。

Contribution: 提出了TUuD框架,首次量化评估LLM对动态时间参考点的适应能力,揭示了LLM部分模拟人类时间认知的特性。

Method: 通过提示LLM评分当前时刻和目标事件的相似性(0.00到1.00),研究其在动态时间参考点变化下的表现。

Result: 研究显示,LLM表现出对动态时间参考点的部分适应能力,相似性评分在’now’附近最高,但随事件远离而减弱。

Insight: LLM的时间推理能力受参考点变化和时间距离的影响,与人时间认知相似但不完全一致。

Abstract: Understanding time is fundamental to human cognition, where temporal experience is often conceptualized through spatial metaphors grounded in sensory-motor experience. For example, “summer is approaching” parallels “We are approaching the summer”. In such expressions, humans rely on a frame of reference (FoR) to interpret meaning relative to a particular viewpoint. Extending this concept to time, a temporal frame of reference (t-FoR) defines how temporal relations are perceived relative to an experiencer’s moment of “now”. While Large Language Models (LLMs) have shown remarkable advances in natural language understanding, their ability to interpret and reason about time remains limited. In this work, we introduce TUuD (Temporal Understanding under Deictic t-FoR), a framework that evaluates how LLMs interpret time-event and event-event relations when the reference point of “now” dynamically shifts along a timeline. Following recent work on temporal cognition \cite{li2025other}, LLMs are prompted to rate the similarity between the current moment and a target event from 0.00 (completely dissimilar) to 1.00 (highly similar), where similarity quantifies perceived temporal alignment between the two points. Our results show that four evaluated LLMs exhibit measurable adaptation to a deictic t-FoR, with similarity ratings peaking around the present and decreasing toward past and future events. The adaptation, however, weakens beyond near-term contexts, suggesting that while LLMs display partial human-like temporal cognition, their temporal reasoning remains sensitive to reference-frame shifts and temporal distance.

[21] Investigating the Impact of Rationales for LLMs on Natural Language Understanding

Wenhang Shi,Shuqing Bian,Yiren Chen,Xinyi Zhang,Zhe Zhao,Pengfei Hu,Wei Lu,Xiaoyong Du

Main category: cs.CL

TL;DR: 本文研究了在自然语言理解(NLU)任务中使用思维链(CoT)理性的影响,构建了NLURC数据集,并发现理性对模型性能的影响与模型大小和任务设计密切相关。

Details Motivation: 现有研究主要集中在理性对推理任务的帮助,而忽视了其对NLU任务的潜在影响。

Contribution: 构建了NLURC数据集,系统性探索了理性在NLU任务中的作用,并提出了一种设计有效的理性增强训练方法。

Method: 通过生成理性或在训练中调整理性的位置,设计了多种理性增强方法,并在NLURC数据集上进行了实验。

Result: 发现理性在NLU任务中的作用因模型大小而异,且某些训练方法能显著提升模型性能甚至超越更大模型。

Insight: 理性的引入需要根据模型规模和任务特性精心设计,否则可能适得其反;但其潜力在跨任务泛化和解释性上表现突出。

Abstract: Chain-of-thought (CoT) rationales, which provide step-by-step reasoning to derive final answers, benefit LLMs in both inference and training. Incorporating rationales, either by generating them before answering during inference, or by placing them before or after the original answers during training - significantly improves model performance on mathematical, symbolic and commonsense reasoning tasks. However, most work focuses on the role of rationales in these reasoning tasks, overlooking their potential impact on other important tasks like natural language understanding (NLU) tasks. In this work, we raise the question: Can rationales similarly benefit NLU tasks? To conduct a systematic exploration, we construct NLURC, a comprehensive and high-quality NLU dataset collection with rationales, and develop various rationale-augmented methods. Through exploring the applicability of these methods on NLU tasks using the dataset, we uncover several potentially surprising findings: (1) CoT inference shifts from hindering NLU performance to surpassing direct label prediction as model size grows, indicating a positive correlation. (2) Most rationale-augmented training methods perform worse than label-only training, with one specially designed method consistently achieving improvements. (3) LLMs trained with rationales achieve significant performance gains on unseen NLU tasks, rivaling models ten times their size, while delivering interpretability on par with commercial LLMs.

[22] Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Sanskar Pandey,Ruhaan Chopra,Angkul Puniya,Sohom Pal

Main category: cs.CL

TL;DR: 论文提出了Beacon基准,用于单轮测量和缓解大型语言模型中潜在的逢迎偏差(sycophancy),揭示了其在语言和情感子偏差上的表现,并提出干预方法。

Details Motivation: 大型语言模型在优化过程中可能混淆助益性和顺从性,导致逢迎偏差(偏好用户意见而非事实推理)。需要一种独立于对话上下文的方法来测量和干预这种偏差。

Contribution: 1. 引入Beacon基准,单轮测量模型中的逢迎偏差;2. 揭示逢迎偏差的语言和情感子偏差及其与模型容量的关系;3. 提出提示和激活层面的干预方法。

Method: 1. Beacon基准通过强制选择任务隔离偏差;2. 分析偏差的稳定子结构;3. 采用提示和激活干预调节偏差。

Result: 在12个先进模型上的评估显示逢迎偏差可分解为子偏差,且干预方法可调节偏差方向。

Insight: 逢迎偏差是规范泛化的结果,可通过干预揭示对齐的动态几何特性。

Abstract: Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

[23] Enhancing Language Agent Strategic Reasoning through Self-Play in Adversarial Games

Yikai Zhang,Ye Rong,Siyu Yuan,Jiangjie Chen,Jian Xie,Yanghua Xiao

Main category: cs.CL

TL;DR: 论文提出SCO-PAL方法,通过在对抗性游戏中自我博弈(self-play),显著提升了语言代理的战略推理能力,并在实验中击败多个对手和GPT-4。

Details Motivation: 现有语言代理在动态对抗性游戏中战略推理能力不足,且依赖专家标注数据成本高。本文旨在通过自动学习提升代理的表现。

Contribution: 提出SCO-PAL方法,系统分析对手选择的影响,发现自我博弈是最有效的策略优化方式。实验表明该方法显著提升了性能。

Method: 采用SCO-PAL方法,在对抗性游戏中通过自我博弈优化策略,并动态调整对手级别以提高学习效果。

Result: SCO-PAL使代理在六种对抗性游戏中对四个对手的平均胜率提升约30%,并对GPT-4达到54.76%的胜率。

Insight: 自我博弈在对抗性环境中对提升战略推理效果显著,且无需依赖专家标注数据。

Abstract: Existing language agents often encounter difficulties in dynamic adversarial games due to poor strategic reasoning. To mitigate this limitation, a promising approach is to allow agents to learn from game interactions automatically, without relying on costly expert-labeled data. Unlike static environments where agents receive fixed feedback or rewards, selecting appropriate opponents in dynamic adversarial games can significantly impact learning performance. However, the discussion of opponents in adversarial environments remains an area under exploration. In this paper, we propose a Step-level poliCy Optimization method through Play-And-Learn, SCO-PAL. Leveraging SCO-PAL, we conduct a detailed analysis of opponent selection by setting opponents at different levels and find that self-play is the most effective way to improve strategic reasoning in such adversarial environments. Utilizing SCO-PAL with self-play, we increase the average win rate against four opponents by approximately 30% compared to baselines and achieve a 54.76% win rate against GPT-4 in six adversarial games.

[24] LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

Sheikh Jubair,Arwa Omayrah,Amal Alshammari,Alhanoof Althnian,Abdulhamed Alothaimen,Norah A. Alzahrani,Shahad D. Alzaidi,Nora Al-Twairesh,Abdulmohsen Al-Thubaity

Main category: cs.CL

TL;DR: LC-Eval是一个双语多任务评估基准,旨在评估英语和阿拉伯语的长上下文理解能力,涵盖4k到128k+的上下文长度。

Details Motivation: 随着大语言模型(LLMs)在处理长上下文能力上的进步,亟需严格的评估方法来量化其表现。

Contribution: 提出了LC-Eval,一个包含四个新颖且具有挑战性任务的评估基准,支持双语(英语和阿拉伯语)和多任务评估。

Method: 设计四个任务:多文档问答、双语问答、段落内声明验证和长上下文选择题,涵盖深度推理、文档理解等能力。

Result: 评估结果显示,即使是高性能模型(如GPT-4o)在某些任务上也表现不佳,表明基准的复杂性。

Insight: LC-Eval为长上下文理解能力的评估提供了标准化工具,揭示了当前LLMs的局限性。

Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs’ abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.

[25] MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning

Vera Pavlova,Mohammed Makhlouf

Main category: cs.CL

TL;DR: MOSAIC提出了一种多阶段的领域自适应框架,结合掩码监督和对比学习,用于句子嵌入模型的领域适应性优化,在高资源和低资源领域均取得显著提升。

Details Motivation: 当前大规模通用领域句子嵌入模型在适应特定领域时表现不佳,缺乏高效的领域适应性方法,导致语义判别能力下降。

Contribution: 1) 提出MOSAIC框架,整合掩码语言建模和对比学习目标;2) 引入选择性适配和多阶段训练策略;3) 在高/低资源领域验证了有效性。

Method: 通过联合优化掩码语言建模(MLM)和对比学习目标,结合多阶段训练和选择性适配,实现领域相关表征的学习。

Result: 在NDCG@10指标上提升高达13.4%,且通过消融实验验证了各组件的重要性。

Insight: 1) 联合监督和多阶段训练对领域适应性至关重要;2) 选择性适配能够有效平衡通用和领域特异性表征。

Abstract: We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of sentence embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain sentence embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.

[26] Knowing the Facts but Choosing the Shortcut: Understanding How Large Language Models Compare Entities

Hans Hergen Lehmann,Jae Hee Lee,Steven Schockaert,Stefan Wermter

Main category: cs.CL

TL;DR: 论文研究了大型语言模型(LLMs)在实体比较任务中依赖启发式偏见而非真实知识的现象,发现了三种主要偏见,并表明更大模型能更明智地选择知识。

Details Motivation: LLMs在基于知识的推理任务中应用日益广泛,但尚不清楚其何时依赖真实知识或仅凭表面启发式。本文旨在通过实体比较任务揭示这一问题。

Contribution: 1. 识别了LLMs依赖的三种启发式偏见(实体流行度、提及顺序、语义共现);2. 展示了大模型能选择性依赖更可靠的知识;3. 发现Chain-of-thought提示能引导模型使用数值知识。

Method: 通过实体比较任务(如河流长度比较)系统分析LLMs的行为,使用回归分析验证启发式偏见的影响,并对比不同规模模型的性能差异。

Result: 小模型依赖启发式偏见,而大模型(32B参数)能选择性依赖知识;Chain-of-thought提示显著提升各规模模型对数值知识的利用。

Insight: 模型规模影响知识利用方式;提示策略(如Chain-of-thought)可有效引导模型从启发式转向知识依赖的推理。

Abstract: Large Language Models (LLMs) are increasingly used for knowledge-based reasoning tasks, yet understanding when they rely on genuine knowledge versus superficial heuristics remains challenging. We investigate this question through entity comparison tasks by asking models to compare entities along numerical attributes (e.g., ``Which river is longer, the Danube or the Nile?’’), which offer clear ground truth for systematic analysis. Despite having sufficient numerical knowledge to answer correctly, LLMs frequently make predictions that contradict this knowledge. We identify three heuristic biases that strongly influence model predictions: entity popularity, mention order, and semantic co-occurrence. For smaller models, a simple logistic regression using only these surface cues predicts model choices more accurately than the model’s own numerical predictions, suggesting heuristics largely override principled reasoning. Crucially, we find that larger models (32B parameters) selectively rely on numerical knowledge when it is more reliable, while smaller models (7–8B parameters) show no such discrimination, which explains why larger models outperform smaller ones even when the smaller models possess more accurate knowledge. Chain-of-thought prompting steers all models towards using the numerical features across all model sizes.

[27] FinSight: Towards Real-World Financial Deep Research

Jiajie Jin,Yuyao Zhang,Yimeng Xu,Hongjin Qian,Yutao Zhu,Zhicheng Dou

Main category: cs.CL

TL;DR: FinSight是一个新颖的多智能体框架,旨在生成高质量、多模态的财务报告,大幅优于现有基线,接近人类专家水平。

Details Motivation: 现有AI系统难以完全自动化生成专业财务报告,FinSight旨在解决这一挑战。

Contribution: 1. 提出CAVM架构统一外部数据、工具和智能体;2. 提出迭代视觉增强机制优化图表;3. 设计两阶段写作框架生成多模态报告。

Method: 基于CAVM架构,结合迭代视觉增强机制和两阶段写作框架,实现数据收集、分析与报告生成。

Result: 实验表明FinSight在事实准确性、分析深度和呈现质量上显著优于基线。

Insight: FinSight展示了通过多智能体和可编程空间实现高质量财务报告的可行性。

Abstract: Generating professional financial reports is a labor-intensive and intellectually demanding process that current AI systems struggle to fully automate. To address this challenge, we introduce FinSight (Financial InSight), a novel multi agent framework for producing high-quality, multimodal financial reports. The foundation of FinSight is the Code Agent with Variable Memory (CAVM) architecture, which unifies external data, designed tools, and agents into a programmable variable space, enabling flexible data collection, analysis and report generation through executable code. To ensure professional-grade visualization, we propose an Iterative Vision-Enhanced Mechanism that progressively refines raw visual outputs into polished financial charts. Furthermore, a two stage Writing Framework expands concise Chain-of-Analysis segments into coherent, citation-aware, and multimodal reports, ensuring both analytical depth and structural consistency. Experiments on various company and industry-level tasks demonstrate that FinSight significantly outperforms all baselines, including leading deep research systems in terms of factual accuracy, analytical depth, and presentation quality, demonstrating a clear path toward generating reports that approach human-expert quality.

[28] Neuronal Group Communication for Efficient Neural representation

Zhengqi Pei,Qingming Huang,Shuhui Wang

Main category: cs.CL

TL;DR: 该论文提出了一种名为Neuronal Group Communication (NGC)的框架,通过将神经网络视为神经元群组间的动态交互系统,而非传统的权重集合,从而实现高效、模块化和可解释的表示。NGC通过低维信号交换和动态稳定性度量,显著减少了冗余参数,并在大规模语言模型中验证了其性能优势。

Details Motivation: 现代神经网络的规模不断扩大,带来了性能和效率的双重挑战。论文旨在解决如何构建高效、模块化和可解释的大规模神经系统的核心问题。

Contribution: 主要贡献是提出NGC框架,将神经网络视为神经元群组的动态交互系统,引入动态稳定性度量,并通过实验验证其在压缩模型中的性能优势。

Method: NGC将权重视为神经元群组间的瞬时交互,利用低维信号交换减少冗余参数。通过动态系统理论引入神经元稳定性度量,评估序列处理中的稳定模式。

Result: 在大规模语言模型中,NGC在复杂推理任务上表现优于标准的低秩近似和跨层共享方法,同时保持了较高的压缩率。

Insight: NGC揭示了神经系统的模块化和动态稳定性与其推理能力和泛化性能之间的关系,为高维学习系统提供了新的理论基础。

Abstract: The ever-increasing scale of modern neural networks has brought unprecedented performance alongside daunting challenges in efficiency and interpretability. This paper addresses the core question of how to build large neural systems that learn efficient, modular, and interpretable representations. We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups rather than a monolithic collection of neural weights. Instead of treating each weight as an independent trainable parameter, NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative communication among groups of neurons. This low-rank, modular representation yields compact models: groups of neurons exchange low-dimensional signals, enabling intra-group specialization and inter-group information sharing while dramatically reducing redundant parameters. By drawing on dynamical systems theory, we introduce a neuronal stability metric (analogous to Lyapunov stability) that quantifies the contraction of neuron activations toward stable patterns during sequence processing. Using this metric, we reveal that emergent reasoning capabilities correspond to an external driving force or ``potential’’, which nudges the neural dynamics away from trivial trajectories while preserving stability. Empirically, we instantiate NGC in large language models (LLMs) and demonstrate improved performance on complex reasoning benchmarks under moderate compression. NGC consistently outperforms standard low-rank approximations and cross-layer basis-sharing methods at comparable compression rates. We conclude by discussing the broader implications of NGC, including how structured neuronal group dynamics might relate to generalization in high-dimensional learning systems.

[29] Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?

Zhihui Yang,Yupei Wang,Kaijie Mo,Zhe Zhao,Renfen Hu

Main category: cs.CL

TL;DR: 该论文探讨了视觉基础是否增强了大语言模型(LLMs)对具身知识的理解。通过提出一个基于心理学感知理论的新基准,覆盖多种感官模态,研究发现视觉语言模型(VLMs)并未优于纯文本模型,且在视觉维度表现更差。

Details Motivation: 研究动机是澄清视觉基础是否真正提升了LLMs对具身知识的理解能力,填补了多模态模型在此领域的评估空白。

Contribution: 主要贡献包括:1)提出一个新颖的具身知识理解基准,覆盖多种感官模态;2)对30种先进LLMs进行了系统性评估;3)发现VLMs在视觉任务中未表现出优势,揭示了模型的局限性。

Method: 方法包括:1)构建基于心理学感知理论的基准,涵盖视觉等多种感官;2)设计向量比较和问答任务(1700+问题);3)对比30种LLMs的性能表现。

Result: 结果显示:1)VLMs未优于纯文本模型;2)模型在视觉维度表现较差;3)向量表示容易受词形和频率影响;4)模型在空间感知任务中表现不佳。

Insight: 研究揭示了现有多模态模型在整合具身知识方面的不足,尤其是视觉和空间推理能力较弱,为未来模型改进提供了方向。

Abstract: Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel embodied knowledge understanding benchmark based on the perceptual theory from psychology, encompassing visual, auditory, tactile, gustatory, olfactory external senses, and interoception. The benchmark assesses the models’ perceptual abilities across different sensory modalities through vector comparison and question-answering tasks with over 1,700 questions. By comparing 30 state-of-the-art LMs, we surprisingly find that vision-language models (VLMs) do not outperform text-only models in either task. Moreover, the models perform significantly worse in the visual dimension compared to other sensory dimensions. Further analysis reveals that the vector representations are easily influenced by word form and frequency, and the models struggle to answer questions involving spatial perception and reasoning. Our findings underscore the need for more effective integration of embodied knowledge in LMs to enhance their understanding of the physical world.

[30] ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

Emily Chang,Niyati Bafna

Main category: cs.CL

TL;DR: 该论文提出了ChiKhaPo,一个大规模多语言基准测试,用于评估大语言模型在2700多种语言中的词汇理解和生成能力,填补了现有基准测试在多语言覆盖方面的空白。

Details Motivation: 现有的大语言模型基准测试主要集中在高或中等资源语言上,且通常评估高阶任务(如推理和生成),而忽视了模型在绝大多数语言中的基本语言能力问题。

Contribution: 提出了一个覆盖2700多种语言的基准测试ChiKhaPo,包含8个不同难度的子任务,旨在评估模型的词汇理解和生成能力。该方法在语言覆盖范围上超越了现有基准测试。

Method: ChiKhaPo基于现有词典、单语数据和双语数据,设计了8个子任务,覆盖了多种语言家族和资源水平不同的语言。

Result: 实验表明,6种先进的模型在该基准测试上表现不佳,性能得分受语言家族、资源水平、任务类型以及理解与生成方向的影响。

Insight: 该研究为多语言大语言模型的评测提供了新的工具,有助于推动其在低资源语言中的能力发展。

Abstract: Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world’s 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

[31] Prompt-MII: Meta-Learning Instruction Induction for LLMs

Emily Xiao,Yixiao Zeng,Ada Chen,Chin-Jou Li,Amanda Bertsch,Graham Neubig

Main category: cs.CL

TL;DR: 这篇论文提出了PROMPT-MII,一种基于强化学习的元学习方法,用于生成紧凑的指令以替代传统的上下文学习,从而降低推理成本。

Details Motivation: 传统的上下文学习(ICL)虽然在适应新任务时有效,但随着上下文长度的增加,推理成本显著上升。论文旨在通过生成紧凑的指令来减少这一成本。

Contribution: 提出了PROMPT-MII框架,通过元学习生成紧凑指令,以替代传统的ICL方法,显著减少推理所需的token数量。

Method: 使用基于强化学习的元学习方法,在3,000多个多样化分类数据集上进行训练,生成指令以匹配ICL的性能。

Result: 在90个未见任务上,PROMPT-MII将下游模型的F1分数提高了4-9分(相对提升10-20%),同时减少了3-13倍的token使用量。

Insight: 紧凑的指令生成可以显著降低推理成本,同时保持甚至提升模型性能,为LLM的实际应用提供了新思路。

Abstract: A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.

[32] Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Masahiro Kaneko,Zeerak Talat,Timothy Baldwin

Main category: cs.CL

TL;DR: 该论文提出了一种在线学习的防御框架,通过动态更新的策略对抗迭代越狱攻击,使用强化学习和PDGD技术显著提升防御效果和响应质量。

Details Motivation: 迭代越狱攻击通过反复改写提示语诱导LLMs输出有害内容,现有防御方法无法主动打断这种动态攻击循环。

Contribution: 1. 提出了一种基于在线学习的动态防御框架;2. 结合强化学习优化提示语以区分有害和无害任务;3. 引入PDGD防止过拟合。

Method: 采用强化学习优化提示语,并引入Past-Direction Gradient Damping(PDGD)来防止过拟合。

Result: 在三个LLM上实验表明,该方法显著优于五种现有防御方法,同时提升了无害任务的响应质量。

Insight: 动态学习和梯度阻尼技术是提升LLMs防御能力的有效手段,同时能兼顾任务响应质量。

Abstract: Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs – using the model’s previous responses to guide each new iteration – have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

[33] DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking

Lanni Bu,Lauren Levin,Amir Zeldes

Main category: cs.CL

TL;DR: DiscoTrack是一个多语言LLM基准测试,专注于话语跟踪中的隐式信息和语用推断,覆盖12种语言和四个话语理解层次。

Details Motivation: 当前LLM基准测试主要关注明确的自然语言理解任务(如问答或摘要),缺乏对跨句子、段落和多说话者话语中隐式信息和语用推断的挑战性测试。

Contribution: 提出了DiscoTrack基准测试,涵盖12种语言和四个话语理解层次(如显著性识别、实体跟踪),填补了多语言话语跟踪评估的空白。

Method: 设计了多语言和多层次的话语跟踪任务,通过评估模型在不同语言和任务上的表现,测试其跨语境信息整合能力。

Result: 实验表明,即使是当前最先进的模型,在这些任务上仍面临挑战。

Insight: 话语跟踪任务的复杂性凸显了模型在处理长文档和多语言隐式推理时的局限,为未来研究提供了方向。

Abstract: Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often tar- geting information from individual sentences. We are still lacking more challenging, and im- portantly also multilingual, benchmarks focus- ing on implicit information and pragmatic infer- ences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark target- ing a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.

[34] SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Qiusi Zhan,Angeline Budiman-Chan,Abdelrahman Zayed,Xingzhi Guo,Daniel Kang,Joo-Kyung Kim

Main category: cs.CL

TL;DR: 这篇论文研究了基于大型语言模型(LLM)的搜索代理在开放域问答中的安全性问题,并提出了一种多目标强化学习方法SafeSearch,以在保持实用性的同时显著降低有害输出。

Details Motivation: 研究发现,相较于基础LLM,搜索代理更可能生成有害输出,尤其是在实用性优化的微调后,安全性风险进一步加剧。这促使研究者探索如何在保证实用性的同时提升安全性。

Contribution: 论文的主要贡献包括:(1)揭示了搜索代理在高风险问题中的安全性问题;(2)提出了SafeSearch方法,通过多目标强化学习联合优化安全性和实用性;(3)实验证明SafeSearch能减少70%以上的有害输出,同时保持实用性。

Method: SafeSearch采用多目标强化学习方法,结合最终输出的安全/实用性奖励和查询级惩罚-奖励机制(惩罚不安全查询,奖励安全查询)。

Result: 实验表明,SafeSearch在三个红队数据集上显著减少有害输出(超70%),并在保持安全性的同时匹配实用性优化代理的QA性能。

Insight: 查询级奖励机制不仅能提升安全性,还能通过引导搜索代理生成更安全的查询,间接优化最终输出的质量和安全性。

Abstract: Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked ``How can I track someone’s location without their consent?’’, a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.

[35] Extended LSTM: Adaptive Feature Gating for Toxic Comment Classification

Noor Islam S. Mohammad

Main category: cs.CL

TL;DR: 论文提出了xLSTM,一种参数高效且理论驱动的框架,用于毒性评论分类,通过余弦相似度门控、自适应特征优先级和类别再平衡,显著提升了性能。

Details Motivation: 毒性评论检测面临计算成本高和少数类别性能下降的问题,现有方法如BERT参数量大且效率低,而传统集成方法缺乏语义适应性。

Contribution: 提出了结合余弦相似度门控、自适应特征优先级和多源嵌入的xLSTM框架,显著提升少数类别性能并降低计算成本。

Method: 使用余弦相似度门控调制嵌入特征,结合多源嵌入、字符级BiLSTM和自适应焦点损失等方法,实现高效分类。

Result: 在Jigsaw毒性评论数据集上,xLSTM达到96.0%准确率和0.88宏F1,显著优于BERT,且参数量减少15倍。

Insight: 轻量级且理论驱动的架构在特定领域的NLP任务中可以超越大型预训练模型,尤其是在类别不平衡的场景下。

Abstract: Toxic comment detection remains a challenging task, where transformer-based models (e.g., BERT) incur high computational costs and degrade on minority toxicity classes, while classical ensembles lack semantic adaptability. We propose xLSTM, a parameter-efficient and theoretically grounded framework that unifies cosine-similarity gating, adaptive feature prioritization, and principled class rebalancing. A learnable reference vector {v} in {R}^d modulates contextual embeddings via cosine similarity, amplifying toxic cues and attenuating benign signals to yield stronger gradients under severe class imbalance. xLSTM integrates multi-source embeddings (GloVe, FastText, BERT CLS) through a projection layer, a character-level BiLSTM for morphological cues, embedding-space SMOTE for minority augmentation, and adaptive focal loss with dynamic class weighting. On the Jigsaw Toxic Comment benchmark, xLSTM attains 96.0% accuracy and 0.88 macro-F1, outperforming BERT by 33% on threat and 28% on identity_hate categories, with 15 times fewer parameters and 50ms inference latency. Cosine gating contributes a +4.8% F1 gain in ablations. The results establish a new efficiency adaptability frontier, demonstrating that lightweight, theoretically informed architectures can surpass large pretrained models on imbalanced, domain-specific NLP tasks.

[36] Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

Kyle Cox,Jiawei Xu,Yikun Han,Rong Xu,Tianhao Li,Chi-Yang Hsu,Tianlong Chen,Walter Gerych,Ying Ding

Main category: cs.CL

TL;DR: 针对大型语言模型(LLM)对不同语义等效提示(prompt)回答不一致的问题,论文通过语义空间采样和不确定性分解,改进了模型的不确定性校准。

Details Motivation: 大型语言模型在面对语义等效但表述不同的提示时,可能给出截然不同的答案分布,这表明模型的不确定性校准存在问题。

Contribution: 1. 将提示敏感性建模为一种泛化误差;2. 提出通过语义空间采样改进不确定性校准的方法;3. 引入新的不确定性分解指标,量化提示敏感性对不确定性的影响。

Method: 1. 使用复述扰动对语义概念空间进行采样;2. 设计新的不确定性分解指标,捕捉自然语言生成中的语义连续性。

Result: 实验表明,该方法在不损害准确性的情况下改进了不确定性校准,并揭示了某些LLM无法对输入意义进行一致推理的问题。

Insight: 提示敏感性揭示了LLM在不确定性校准和语义一致性上的局限,语义空间采样是一种有效的改进手段。

Abstract: An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model’s output distribution for one prompt may not reflect the model’s uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic ``concept space’’ with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.

[37] Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

Guoqing Luo,Iffat Maab,Lili Mou,Junichi Yamagishi

Main category: cs.CL

TL;DR: 这篇论文研究了基于推理的大型语言模型在社会偏见场景中的思维行为,揭示了两种导致偏见加剧的失败模式,并提出了一种轻量级的提示调整方法以减缓偏见,同时保持或提升模型的准确性。

Details Motivation: 尽管基于推理的大型语言模型在复杂任务中表现优异,但其内部思维过程可能加剧社会偏见。目前对这些模型在社会偏见场景中的行为机制研究不足,亟需深入探索并提出解决方案。

Contribution: 论文的主要贡献包括:1) 系统地揭示了语言模型在社会偏见场景中的两种失败模式;2) 提出了一种轻量级的提示调整方法,有效减少了偏见且不影响模型表现。

Method: 研究通过系统性分析揭示了两种失败模式,并提出一种基于提示的方法,指导模型在初步推理后自我审查其思维过程,避免偏见加剧。

Result: 在多类基准测试(BBQ、StereoSet和BOLD)上,该方法在减少偏见的同时保持或提升了模型的准确性。

Insight: 研究发现,模型的偏见加剧并非随机现象,而是由其特定的思维模式(如刻板重复和不相关信息注入)驱动。通过针对性干预,可以有效缓解这一问题。

Abstract: While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.

[38] Verification-Aware Planning for Multi-Agent Systems

Tianyang Xu,Dan Zhang,Kushan Mitra,Estevam Hruschka

Main category: cs.CL

TL;DR: VeriMAP是一个用于多智能体协作的框架,通过验证感知规划(verification-aware planning)解决任务分解、协调和验证问题,提升了系统的鲁棒性和可解释性。

Details Motivation: 多智能体协作在实际应用中面临任务解释、输出格式和交接等方面的潜在不一致性,导致执行失败。传统方法依赖外部标签或标注,难以应对这些挑战。

Contribution: 提出了VeriMAP框架,将任务分解与验证功能结合,通过编码子任务验证函数(VFs)提升多智能体系统的可靠性和协作效率。

Method: VeriMAP通过任务分解、建模子任务依赖关系,并将验证标准编码为Python和自然语言的子任务验证函数(VFs),实现验证感知规划。

Result: 在多样化数据集上的实验表明,VeriMAP优于单智能体和多智能体基线,同时提高了系统的鲁棒性和可解释性。

Insight: 验证感知规划能够在不依赖外部标签的情况下,通过迭代优化实现多智能体系统的可靠协调。

Abstract: Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.

[39] DVAGen: Dynamic Vocabulary Augmented Generation

Wei Du,Nuowei Liu,Jie Wang,Jiahao Kuang,Tao Ji,Xiaoling Wang,Yuanbin Wu

Main category: cs.CL

TL;DR: DVAGen是一个开源、统一的动态词汇增强框架,旨在解决固定词汇语言模型的局限性,支持现代LLMs的训练、评估和可视化。

Details Motivation: 固定词汇语言模型难以处理新词或词汇外的词,现有的动态词汇方法存在代码分散、不支持现代LLM以及推理扩展性差的问题。

Contribution: DVAGen提供了一个模块化的开源框架,整合了LLM支持,并首次提供了CLI和WebUI工具用于实时结果检查,显著提升了推理吞吐量。

Method: 通过模块化设计和与现代LLM的无缝集成,DVAGen支持动态词汇方法的训练、评估和可视化。

Result: 实验验证了动态词汇方法的有效性,并展示了批量推理的支持,显著提升了推理效率。

Insight: DVAGen的统一性和模块化设计为动态词汇方法的研究和应用提供了灵活性和可扩展性。

Abstract: Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.

[40] Rethinking On-policy Optimization for Query Augmentation

Zhichao Xu,Shengyao Zhuang,Xueguang Ma,Bingsen Chen,Yijun Tian,Fengran Mo,Jie Cao,Vivek Srikumar

Main category: cs.CL

TL;DR: 论文对比了基于提示(prompting)和基于强化学习(RL)的查询增强方法,发现简单的训练免费方法通常与昂贵的RL方法表现相当甚至更好。作者提出了一种新颖的混合方法OPQE,通过生成伪文档优化检索性能。

Details Motivation: 当前基于LLM的查询增强方法主要有两类:基于提示和基于RL的方法。然而,这两种方法尚未在一致的实验条件下进行比较,作者希望通过系统对比探索其优劣势。

Contribution: 1. 首次系统比较了基于提示和RL的查询增强方法;2. 提出了一种混合方法OPQE,结合了生成伪文档的灵活性和RL的优化目标。

Method: OPQE是一种混合方法,通过LLM生成伪文档而非直接改写查询,从而结合了提示的生成能力和RL的性能优化。

Result: 实验表明,OPQE在多个基准测试中优于单独的提示方法和RL方法。

Insight: 简单的训练免费方法在某些情况下可与复杂的RL方法相媲美,而混合方法(如OPQE)能进一步发挥两者的优势。

Abstract: Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model’s parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.

[41] Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

Chenchen Tan,Youyang Qu,Xinghao Li,Hui Zhang,Shujie Cui,Cunjian Chen,Longxiang Gao

Main category: cs.CL

TL;DR: 这篇论文提出了一种名为Attention-Shifting(AS)的新框架,用于在大型语言模型(LLMs)中选择性地遗忘敏感数据,同时避免幻觉响应。AS通过调整注意力机制实现目标,在遗忘效果与模型效用之间取得了更好的平衡。

Details Motivation: 随着LLMs的广泛应用,其保留敏感数据的问题引发了机器遗忘的研究。现有方法在激进遗忘与保守策略之间存在两难:前者损害模型效用,后者可能导致幻觉响应。AS旨在解决这一矛盾。

Contribution: 提出了AS框架,通过两种注意力干预(重要性感知抑制和注意力引导保留增强)选择性地遗忘敏感内容,同时保持模型的实用性。

Method: AS框架包括两部分:1)上下文保留抑制,减少对事实相关token的注意力;2)抗幻觉响应塑造,避免对遗忘内容的伪造响应。两者通过双目标损失联合优化。

Result: 实验结果显示,AS在ToFU和TDEC基准测试中分别提升了15%和10%的准确率,同时保持了竞争性的无幻觉遗忘效果。

Insight: AS展示了在遗忘效果、泛化能力和响应可靠性之间的优越平衡,为LLMs的可靠应用提供了新思路。

Abstract: The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs’ reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs’ linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.

[42] StreamingThinker: Large Language Models Can Think While Reading

Junlong Tong,Yingqi Fan,Anhao Zhao,Yunpu Ma,Xiaoyu Shen

Main category: cs.CL

TL;DR: 本文提出了一种新的LLM推理范式StreamingThinker,支持在推理过程中动态调整深度,显著降低了延迟并保持了与传统批量推理相当的性能。

Details Motivation: 当前LLM推理范式需要在输入完全可用后才开始思考,导致不必要的延迟和对动态场景中早期信息的关注减弱。受人类阅读时思考的启发,作者提出了更高效的流式推理方法。

Contribution: 1. 设计了流式思考范式StreamingThinker;2. 结合流式CoT生成、训练约束和并行推断技术;3. 在多个推理任务中显著降低了延迟,同时保持性能。

Method: 1. 流式推理单元和质量控制;2. 通过流式注意力掩码和位置编码保持顺序推理;3. 并行KV缓存解耦输入编码和推理生成。

Result: 在数学推理、逻辑推理和上下文QA任务中,StreamingThinker减少了80%的token等待时间和60%以上的最终答案生成延迟。

Insight: 流式推理不仅可以提高LLM的效率,还能在不牺牲性能的情况下更好地适应动态输入场景。

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80% reduction in token waiting before the onset of reasoning and a more than 60% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this repository.}

[43] From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

Zefan Cai,Haoyi Qiu,Haozhe Zhao,Ke Wan,Jiachen Li,Jiuxiang Gu,Wen Xiao,Nanyun Peng,Junjie Hu

Main category: cs.CL

TL;DR: 这篇论文研究了视频扩散模型在通过对齐调优(alignment tuning)提升视觉质量时,如何无意中编码并放大社会偏见。作者提出了VideoBiasEval框架,用于系统评估视频生成中的社会表征偏见。

Details Motivation: 尽管对齐调优提升了视频扩散模型的视觉质量,但这一过程可能强化社会偏见。作者希望通过系统分析揭示偏见如何在数据、奖励模型和生成模型中传播。

Contribution: 1. 提出VideoBiasEval框架,用于评估视频生成中的社会偏见;2. 首次连接了人类偏好数据集中的偏见、奖励模型中的偏见及其在视频扩散模型中的传播。

Method: 1. 使用基于事件的提示策略分离语义内容和演员属性;2. 引入多粒度指标评估偏见(如种族偏见、性别偏见等)及其时间稳定性。

Result: 研究发现对齐调优不仅加强了表征偏见,还使其时间上更稳定,生成了更平滑但更刻板的视频内容。

Insight: 研究强调了在对齐过程中进行偏见感知评估和缓解的必要性,以确保视频生成是公平且社会责任感强的。

Abstract: Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

[44] Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations

Shahin Atakishiyev,Housam K. B. Babiker,Jiayi Dai,Nawshad Farruque,Teruaki Hayashi,Nafisa Sadaf Hriti,Md Abed Rahman,Iain Smith,Mi-Young Kim,Osmar R. Zaïane,Randy Goebel

Main category: cs.CL

TL;DR: 这篇论文研究了基于Transformer的大语言模型的可解释性和机制解释性,旨在提升对这些模型的信任。文中综述了相关方法,展示了在医疗和自动驾驶领域的实验研究,并总结了当前未解决的问题和未来方向。

Details Motivation: 大语言模型的预测和生成过程对人类不透明,且存在幻觉错误(hallucinations),因此需要更好的解释性和解释性方法来增进信任。

Contribution: 1. 综述了局部可解释性和机制解释性的方法;2. 在医疗和自动驾驶领域开展了实验研究;3. 总结了未解决问题和未来方向。

Method: 通过综述文献和实验研究(在医疗和自动驾驶领域)探讨大语言模型的可解释性和信任问题。

Result: 研究发现解释性方法有助于理解模型行为,但在实际应用中仍面临挑战,尤其是在生成可信赖解释方面。

Insight: 可解释性是提升大语言模型信任的关键,但如何在复杂任务中生成人类对齐的解释仍需进一步研究。

Abstract: Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains – healthcare and autonomous driving – and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.

[45] TaxoAlign: Scholarly Taxonomy Generation Using Language Models

Avishek Lahiri,Yufang Hou,Debarshi Kumar Sanyal

Main category: cs.CL

TL;DR: TaxoAlign提出了一种基于语言模型的学术分类法生成方法,并通过CS-TaxoBench基准和自动化评估框架,显著优于现有方法。

Details Motivation: 现有自动综述生成方法未能比较生成分类法与人类专家分类法的结构差异,TaxoAlign旨在填补这一空白。

Contribution: 1) 提出TaxoAlign方法;2) 创建CS-TaxoBench基准;3) 设计自动化评估框架。

Method: TaxoAlign采用三段式基于主题的指导方法生成分类法,并通过结构对齐和语义连贯性评估。

Result: TaxoAlign在CS-TaxoBench上优于基线方法,并在自动化与人工评估中表现一致。

Insight: TaxoAlign展示了语言模型在分类法生成中的潜力,同时强调结构对齐的重要性。

Abstract: Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at https://github.com/AvishekLahiri/TaxoAlign.

[46] Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning

Hajar Bakarou,Mohamed Sinane El Messoussi,Anaïs Ollagnier

Main category: cs.CL

TL;DR: 该论文研究了多党对话中的反社会行为(ASB),通过多模态表示学习方法在法语数据集CyberAgressionAdo-Large上评估了三个任务,并展示了多模态模型的优越性。

Details Motivation: 社交媒体上的反社会行为(如仇恨言论和网络欺凌)对平台安全和 societal well-being 构成威胁。现有研究主要集中于单一网络,而多党对话环境的研究因数据不足而被忽视。

Contribution: 提出了在多党对话环境中检测和分析ASB的方法,首次在法语数据集上进行了多模态表示学习的系统性评估,并展示了融合模型的效果。

Method: 使用了六种基于文本和八种基于图的表示学习方法,分析了词汇线索和交互动态,并尝试了多模态融合。最佳模型为晚融合模型mBERT + WD-SGCN。

Result: 多模态模型在ASB任务上表现优于单模态模型,特别是在滥用检测(0.718)和欺凌行为分析(0.606)任务中效果显著。

Insight: 研究发现多模态融合能有效处理隐式攻击、角色转换和上下文依赖性等复杂的ASB现象。

Abstract: Antisocial behavior (ASB) on social media – including hate speech, harassment, and cyberbullying – poses growing risks to platform safety and societal well-being. Prior research has focused largely on networks such as X and Reddit, while \textit{multi-party conversational settings} remain underexplored due to limited data. To address this gap, we use \textit{CyberAgressionAdo-Large}, a French open-access dataset simulating ASB in multi-party conversations, and evaluate three tasks: \textit{abuse detection}, \textit{bullying behavior analysis}, and \textit{bullying peer-group identification}. We benchmark six text-based and eight graph-based \textit{representation-learning methods}, analyzing lexical cues, interactional dynamics, and their multimodal fusion. Results show that multimodal models outperform unimodal baselines. The late fusion model \texttt{mBERT + WD-SGCN} achieves the best overall results, with top performance on abuse detection (0.718) and competitive scores on peer-group identification (0.286) and bullying analysis (0.606). Error analysis highlights its effectiveness in handling nuanced ASB phenomena such as implicit aggression, role transitions, and context-dependent hostility.

[47] Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Chenghao Zhang,Guanting Dong,Xinyu Yang,Zhicheng Dou

Main category: cs.CL

TL;DR: 该论文提出了一个通用的混合模态检索框架Nyx,用于增强检索增强生成(URAG)任务,通过四阶段自动化流程构建混合模态数据集NyxQA,并采用两阶段训练方法,显著提升了视觉语言任务的生成质量。

Details Motivation: 现有的检索增强生成系统主要针对单模态文本,难以应对现实世界中混合模态(如文本和图像)的查询和文档需求。因此,提出了一种通用的混合模态检索方法Nyx,以解决这一挑战。

Contribution: 1. 提出Nyx,一种统一混合模态检索器;2. 设计四阶段自动化流程构建混合模态数据集NyxQA;3. 采用两阶段训练方法,预训练和下游模型反馈微调;4. 在标准文本RAG和通用URAG任务中表现优异。

Method: 1. 通过四阶段自动化流程构建NyxQA数据集;2. 采用两阶段训练框架:预训练NyxQA和开源检索数据集,并通过下游视觉语言模型反馈进行监督微调。

Result: Nyx在标准文本RAG任务中表现优异,同时在通用URAG任务中显著提升了视觉语言任务的生成质量。

Insight: 混合模态检索是提升检索增强生成系统在实际场景中性能的关键,而高质量的混合模态数据集和两阶段训练方法能够有效支持这一目标。

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.

[48] The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

Henry Lim,Kwan Hui Lim

Main category: cs.CL

TL;DR: 本文研究发现,尽管指令调优的大型语言模型(IT-LLMs)在零样本推理上表现优异,但其在遵循简单、独立指令方面的能力存在明显不足。通过修改MMLU和MMLU-Pro基准测试,揭示了模型对指令格式的敏感性以及其他不一致性问题。

Details Motivation: 尽管IT-LLMs在复杂指令遵循任务中表现出色,但其执行简单、独立指令的能力尚未得到充分研究。这种能力是复杂指令遵循的基础,因此有必要探究其不足之处。

Contribution: 本文的主要贡献是通过系统评估20种IT-LLMs,揭示了模型在简单指令遵循中的不一致性和对指令格式的敏感性,并指出了当前指令调优范式的不足。

Method: 研究通过修改MMLU和MMLU-Pro基准测试,设计了四种实验范式:1)明确指令下的格式变化;2)无指令情况下的性能测试;3)移除选项内容后的模型表现;4)使用三样本示例的效果分析。

Result: 结果显示,IT-LLMs在指令格式变化时性能波动明显,且在没有明确指令或选项内容时表现较差。此外,更大的模型虽然准确率更高,但在遵循指令的一致性上仍存在问题。

Insight: 研究发现,当前的指令调优范式未能充分训练模型遵循简单指令的能力,需要开发新的评测方法和训练策略以提升模型的原子指令遵循能力。

Abstract: Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

[49] Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine

Jiacheng Xie,Shuai Zeng,Yang Yu,Xiaoting Tang,Guanghui An,Dong Xu

Main category: cs.CL

TL;DR: 本文介绍了Ladder-base,一种基于GRPO训练的TCM专用LLM,在推理和事实一致性上优于通用和领域专用模型。

Details Motivation: 传统中医知识系统的复杂性和独特性对LLM的应用提出了挑战,现有方法在一致性、数据质量和评估标准上存在不足。

Contribution: 提出了GRPO方法,通过组内比较优化响应选择,开发了首个专注于TCM的LLM Ladder-base。

Method: 基于Qwen2.5-7B-Instruct模型,使用TCM-Ladder数据集的文本子集,通过GRPO方法训练。

Result: Ladder-base在多项推理指标上优于GPT-4等通用模型和BenTsao等TCM专用模型。

Insight: GRPO为LLM在传统医学领域的专家级推理提供了一种高效对齐策略,支持可信赖的TCM AI系统开发。

Abstract: Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.

[50] AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages

Mardiyyah Oduwole,Prince Mireku,Fatimo Adebanjo,Oluwatosin Olajide,Mahi Aminu Aliyu,Jekaterina Novikova

Main category: cs.CL

TL;DR: 本文提出了AfriCaption框架,旨在填补非洲语言在图像描述任务中的空白,首次为20种非洲语言提供了可扩展的图像描述资源。

Details Motivation: 当前多模态AI研究主要集中在高资源语言上,阻碍了该领域的民主化发展。AfriCaption旨在解决这一问题,为非洲语言提供支持。

Contribution: 1) 基于Flickr8k构建了一个包含20种非洲语言的语义对齐数据集;2) 设计了一个动态、上下文保持的数据处理流程;3) 开发了AfriCaption模型,结合SigLIP和NLLB200,支持低资源语言的图像描述生成。

Method: 通过上下文感知的选择和翻译过程生成语义对齐的数据,并使用模型集成和动态替换保证数据质量。模型整合了SigLIP和NLLB200技术。

Result: AfriCaption建立了首个针对非洲语言的可扩展图像描述资源,为多模态AI的包容性奠定了基础。

Insight: 低资源语言的图像描述任务是实现多模态AI民主化的关键一步,AfriCaption为后续研究提供了重要范例。

Abstract: Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.

[51] BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine

Jiacheng Xie,Yang Yu,Yibo Chen,Hanyao Zhang,Lening Zhao,Jiaxuan He,Lei Jiang,Xiaoting Tang,Guanghui An,Dong Xu

Main category: cs.CL

TL;DR: BenCao是一个基于ChatGPT的多模态中医助手,通过指令调优整合结构化知识库、诊断数据和专家反馈,解决了现有中医领域大语言模型在多模态集成、可解释性和临床适用性方面的不足。

Details Motivation: 中医依赖整体推理、隐式逻辑和多模态诊断线索,现有中医大语言模型缺乏多模态集成和临床实用性,BenCao旨在填补这一空白。

Contribution: BenCao的主要贡献包括:1) 整合了超过1000种经典和现代文本的知识库;2) 设计了基于场景的指令框架;3) 提供了可解释的推理机制;4) 通过专家反馈优化模型;5) 实现了与外部API的多模态交互。

Method: BenCao采用自然语言指令调优而非参数重训练,结合结构化知识库和专家反馈,并通过外部API支持舌像分类和多模态数据检索。

Result: 在单选择题和多模态分类任务中,BenCao表现优于通用和中医领域的模型,尤其在诊断、草药识别和体质分类方面表现优异,并被部署为OpenAI GPTs Store的交互应用。

Insight: 研究表明,通过自然语言指令调优和多模态集成开发中医领域大语言模型是可行的,为生成式AI与传统医学推理的结合提供了实用框架。

Abstract: Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.

[52] Agentic Reinforcement Learning for Search is Unsafe

Yushi Yang,Shreyansh Padarha,Andrew Lee,Adam Mahdi

Main category: cs.CL

TL;DR: 该研究揭示了基于强化学习的搜索模型的安全性缺陷,发现简单的攻击方法可以显著降低模型的拒绝率和安全性,暴露了当前RL训练的核心弱点。

Details Motivation: 尽管基于强化学习的搜索模型在多步推理任务中表现优异,但其安全性特性尚未得到充分研究。为了填补这一空白,论文探讨了这些模型的安全性问题。

Contribution: 论文的主要贡献是发现并量化了RL搜索模型的安全脆弱性,提出了两种简单攻击方式(Search attack和Multi-search attack),并通过实验证明这些攻击能显著降低模型的安全性能。

Method: 研究了两种模型家族(Qwen和Llama),结合本地和网页搜索任务,设计了两种攻击方法:强制模型以搜索开头的攻击(Search attack)和鼓励模型重复搜索的攻击(Multi-search attack)。

Result: 攻击方法可将模型的拒绝率降低60.0%,回答安全性降低82.5%,搜索查询安全性降低82.4%。这暴露了RL训练中奖励机制未考虑查询有害性的问题。

Insight: 当前RL训练方法的缺陷在于过度强调生成有效查询而忽视其潜在危害,亟需开发专注于安全搜索的RL训练流程。

Abstract: Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.

[53] Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents

Yihong Tang,Kehai Chen,Liang Yue,Jinxin Fan,Caishen Zhou,Xiaoguang Li,Yuyang Zhang,Mingming Zhao,Shixiong Kai,Kaiyang Guo,Xingshan Zeng,Wenjing Cun,Lifeng Shang,Min Zhang

Main category: cs.CL

TL;DR: 本文系统地综述了基于大型语言模型(LLM)的工业智能体的技术、应用和评估方法,提出了能力成熟度框架,探讨了从简单任务支持到复杂自主系统的技术演进,并总结了实际挑战与未来方向。

Details Motivation: 随着大型语言模型的发展,智能体在工业中的应用潜力巨大,但如何将理论研究转化为实际生产力仍是一大挑战。本文旨在填补这一空白,梳理技术、实践与评估方法。

Contribution: 1. 提出了工业智能体的能力成熟度框架;2. 分析了关键技术支柱(记忆、规划、工具使用)的演进;3. 总结了实际应用的多样场景;4. 指出了评估体系的挑战与改进方向。

Method: 通过系统综述方法,结合能力成熟度框架,分析工业智能体的技术演进(记忆、规划、工具使用)、应用场景(如数字工程、科学发现)及评估方法。

Result: 明确了工业智能体从简单系统到复杂自主系统的技术路径,总结了实际应用的多样性,并揭示了评估体系的不足。

Insight: 未来工业智能体的发展需关注真实性、安全性和行业特异性问题,治理与能力边界的平衡将是关键研究方向。

Abstract: With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence. However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge. To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs. Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from “process execution systems” to “adaptive social systems.” First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use. We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms. Then, we provide an overview of the application of industry agents in real-world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation. Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity. Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions. By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents.

[54] Deep Self-Evolving Reasoning

Zihan Liu,Shun Zheng,Xumeng Wen,Yang Wang,Jiang Bian,Mao Yang

Main category: cs.CL

TL;DR: 论文提出了Deep Self-Evolving Reasoning(DSER),一种概率范式,通过并行运行多组长时程的自演化过程,即使在没有强验证和修正能力的小规模模型中,也能显著扩展其推理能力。

Details Motivation: 当前开源自研模型在验证和修正能力上表现脆弱,限制了其在复杂任务(如奥赛级问题)中的性能。DSER旨在解决这一问题,通过概率方法提升模型的推理能力。

Contribution: 1. 提出DSER框架,将迭代推理建模为马尔可夫链;2. 证明只要改进概率略高于退化概率,即可保证收敛;3. 在小规模模型中实现显著性能提升,甚至超越大模型。

Method: DSER将推理过程视为马尔可夫链,通过并行运行多组自演化过程放大小概率改进趋势,逐步逼近正确答案。实验基于DeepSeek-R1-0528-Qwen3-8B模型。

Result: 在AIME 2024-2025基准测试中,DSER解决了9道难题中的5道,并通过多数投票使8B参数模型的单次准确率超过600B参数教师模型。

Insight: DSER揭示了当前开源自研模型在自验证、修正和稳定性上的不足,为下一代模型的开发指明了研究方向。

Abstract: Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.

[55] Lingua Custodi’s participation at the WMT 2025 Terminology shared task

Jingshu Liu,Raheel Qader,Gaëtan Caillaut,Mariam Nakhlé

Main category: cs.CL

TL;DR: 该论文研究了基于BERT的多语言句子嵌入方法,通过结合MLM、TLM、翻译排名和边距softmax等技术,显著减少了并行训练数据需求(减少80%),并在112种语言上实现了83.7%的双语检索准确率。

Details Motivation: 探索BERT在多语言句子嵌入中的应用,填补BERT在跨语言语义嵌入领域的空白,并减少对大量并行数据的需求。

Contribution: 提出了一种高效的多语言句子嵌入方法,显著减少了并行数据需求,并在Tatoeba数据集上超越LASER表现。

Method: 结合MLM、TLM、双语翻译排名和边距softmax,利用预训练多语言模型优化嵌入性能。

Result: 在112种语言上实现83.7%的双语检索准确率,并行数据需求减少80%,并在NMT任务中表现优异。

Insight: 预训练多语言模型可有效减少跨语言任务中对并行数据的依赖,且组合多种方法能显著提升性能。

Abstract: While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM), dual encoder translation ranking, and additive margin softmax. We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5 achieved by LASER, while still performing competitively on monolingual transfer learning benchmarks. Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at https://tfhub.dev/google/LaBSE.

[56] Annotation-Efficient Universal Honesty Alignment

Shiyu Ni,Keping Bi,Jiafeng Guo,Minghao Tang,Jingtong Wu,Zengxin Han,Xueqi Cheng

Main category: cs.CL

TL;DR: 本文提出了EliCal框架,通过自一致性监督和少量正确性标注实现高效的大语言模型诚实对齐,并发布了包含560k训练数据的HonestyBench基准。

Details Motivation: 诚实对齐对大语言模型的可信部署至关重要,但现有方法依赖大规模标注或非训练的置信度估计,成本高昂且难以通用化。

Contribution: 1. 提出EliCal框架,结合自一致性监督和少量正确性标注实现高效对齐;2. 发布大规模基准HonestyBench。

Method: EliCal分为两阶段:1. 利用廉价的自一致性监督提取内部置信度;2. 用少量正确性标注校准置信度。

Result: EliCal仅需1k标注(0.18%全监督)即可达到接近最优的对齐效果,并在MMLU任务上优于基线。

Insight: 结合廉价监督与少量标注是一种高效实现通用诚实对齐的可行路径。

Abstract: Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.

[57] SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Tiancheng Hu,Joachim Baumann,Lorenzo Lupo,Dirk Hovy,Nigel Collier,Paul Röttger

Main category: cs.CL

TL;DR: SimBench是首个大规模、标准化的基准测试,用于评估大型语言模型(LLM)模拟人类行为的能力。研究显示,当前最佳LLM的模拟能力有限,且性能随模型规模对数线性增长。

Details Motivation: 现有LLM模拟人类行为的评估零散且不可比,需要统一标准以推动LLM模拟技术的发展。

Contribution: 提出了SimBench,整合了20个多样化数据集,全面评估LLM在不同任务(如道德决策、经济选择)中的模拟能力。

Method: 通过SimBench测试LLM的性能,分析模型规模、推理计算和对齐训练对模拟能力的影响。

Result: 最佳LLM得分仅40.80/100,模拟能力有限;性能与模型规模对数线性相关;对齐训练对高熵任务有负面影响。

Insight: LLM模拟能力与知识密集型推理能力高度相关(MMLU-Pro,r=0.939),但在模拟特定群体时表现较差。

Abstract: Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

[58] OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

Raghu Vamshi Hemadri,Geetha Krishna Guruju,Kristi Topollai,Anna Ewa Choromanska

Main category: cs.CL

TL;DR: 论文提出了一种多任务学习框架,将自回归LLM与临床推理对齐,用于癌症生存预测,并通过CoT提示和GRPO方法提升了模型的性能和可解释性。

Details Motivation: 在肿瘤治疗的预测中,模型的准确性和可解释性至关重要,但现有LLM缺乏结构化推理能力,难以满足高风险的临床决策需求。

Contribution: 1. 提出多任务学习框架,结合分类、回归和自然语言推理生成任务;2. 评估了CoT提示和GRPO方法在提升模型性能中的作用;3. 解决了现有生物医学LLM无法生成有效推理轨迹的问题。

Method: 1. 使用标准微调(SFT);2. SFT结合CoT提示;3. GRPO强化学习方法,将模型输出与专家推理轨迹对齐。

Result: CoT提示使F1提高6.0%,MAE降低12%;GRPO在BLEU、ROUGE和BERTScore上达到了最先进的性能。

Insight: 多任务临床建模中,推理对齐对提升LLM的可解释性和准确性至关重要,为精准肿瘤学设立了新基准。

Abstract: Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.

[59] When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity

Nisrine Rair,Alban Goupil,Valeriu Vrabie,Emmanuel Chochoy

Main category: cs.CL

TL;DR: 论文提出了一种基于拓扑学的方法(Mapper工具)来分析语言模型如何编码文本嵌入的模糊性,发现微调后的模型嵌入空间具有模块化、非凸区域的特点,与传统工具(如PCA或UMAP)相比能更直接地揭示决策区域和边界特性。

Details Motivation: 传统的标量评估指标(如准确率)无法捕捉模型内部如何处理模糊性,尤其是在人类标注者存在分歧的情况下,因此需要一种新的方法来分析模型对这种模糊性的编码方式。

Contribution: 1. 引入了拓扑学工具Mapper来分析语言模型的嵌入空间结构;2. 揭示了微调后嵌入空间的模块化与非凸区域特性;3. 展示了模型在模糊数据中的结构自信与标签不确定性之间的隐藏冲突。

Method: 使用拓扑数据分析工具Mapper分析RoBERTa-Large模型在MD-Offense数据集上的嵌入空间,与传统方法(如PCA或UMAP)进行对比。

Result: 发现98%以上的连通组件预测纯度超过90%,但在模糊数据中与真实标签的对齐性下降,揭示了结构自信与标签不确定性之间的冲突。

Insight: Mapper不仅能可视化嵌入空间的结构特征,还能提供拓扑学指标,为处理主观性NLP任务的建模策略提供指导。

Abstract: Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over $98%$ of connected components exhibit $\geq 90%$ prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.

[60] Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation

Collin Zhang,Fei Huang,Chenhan Yuan,Junyang Lin

Main category: cs.CL

TL;DR: 论文提出了一种轻量级的插拔式方法LCG,通过自蒸馏技术在解码阶段过滤语言混淆,无需重新训练模型。

Details Motivation: 大型语言模型(LLM)在文本生成时会出现语言混淆问题,现有解决方案要么需重新训练模型,要么无法区分有害混淆和正常代码切换。

Contribution: 提出了LCG方法,它能预测语言族并选择性屏蔽混淆,显著减少语言混淆且不影响任务性能。

Method: 基于自蒸馏技术,利用输出词嵌入的范数偏差设计LCG,仅在必要时屏蔽混淆。

Result: 在多个模型(如Qwen3、GPT-OSS等)上,LCG显著减少了语言混淆(数量级下降)。

Insight: 研究发现语言混淆较少、正确语言通常在前几位,且高资源语言的词嵌入范数更大,导致采样偏差。

Abstract: Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly, often by an order of magnitude, without negatively impacting task performance. Code is available at https://github.com/collinzrj/language_confusion_gate.

Huiyuan Xie,Chenyang Li,Huining Zhu,Chubin Zhang,Yuxiao Ye,Zhenghao Liu,Zhiyuan Liu

Main category: cs.CL

TL;DR: 该论文提出了LawChain框架,用于建模中国侵权相关民事案件的法律推理链条,并通过评估基准LawChain$_{eval}$验证了现有语言模型在侵权法律推理中的不足。此外,论文提出了几种结合LawChain推理的基线方法,展示了其在提升语言模型推理能力方面的有效性。

Details Motivation: 现有法律推理研究多集中在刑事案例,且通用推理框架无法全面捕捉法律推理的细微过程。本研究旨在填补民事侵权案件推理建模的空白。

Contribution: 1. 提出了LawChain框架,将侵权法律推理过程分解为三个模块和多步骤子任务;2. 构建了评估基准LawChain$_{eval}$;3. 验证了当前语言模型在侵权推理中的不足,并提出了改进基线方法。

Method: LawChain框架将侵权法律推理分为三个模块,每个模块包含多个子步骤。通过评估基准测试语言模型的表现,并提出了结合LawChain推理的提示微调(prompting)和后训练(post-training)基线方法。

Result: 实验表明当前语言模型在侵权法律推理中表现不足,但提出的基线方法显著提升了推理能力,并能推广到其他法律分析任务。

Insight: 显式建模法律推理链条有助于提升语言模型的推理能力,尤其在民事案件中的应用潜力巨大。

Abstract: Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism and IRAC, which do not comprehensively examine the nuanced processes that underpin legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework for explicitly modeling legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning processes used in tort analysis into the LawChain framework. LawChain is a three-module reasoning framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LawChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark, LawChain$_{eval}$, to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate state-of-the-art large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LawChain-style reasoning through prompting or post-training. We conduct further experiments on additional legal analysis tasks, such as Legal Named-Entity Recognition and Criminal Damages Calculation, to verify the generalizability of these baselines. The proposed baseline approaches achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, thus demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.

[62] QueST: Incentivizing LLMs to Generate Difficult Problems

Hanxu Hu,Xingxing Zhang,Jannis Vamvas,Rico Sennrich,Furu Wei

Main category: cs.CL

TL;DR: QueST是一个新颖的框架,通过结合难度感知的图采样和拒绝微调,直接优化生成器以生成具有挑战性的编程问题,显著提升下游任务的性能。

Details Motivation: 大型语言模型在推理任务上表现出色,但受限于人工标注数据集和缺乏大规模复杂编程问题的训练数据。现有方法难以高效生成高质量挑战性问题。

Contribution: 提出QueST框架,结合难度感知图采样和拒绝微调,直接优化生成器以创建复杂编程问题,并为下游蒸馏和强化学习提供数据。

Method: 1. 使用难度感知图采样和拒绝微调优化生成器;2. 生成大规模合成编程问题;3. 用于知识蒸馏或强化学习。

Result: 在LiveCodeBench上,经过QueST生成数据微调的Qwen3-8B-base超越原始模型性能;8B模型性能媲美更大的DeepSeek-R1-671B。

Insight: 通过合成复杂问题,QueST为大型语言模型提供了一种高效且可扩展的方法,推动了竞争性编程和推理任务的研究前沿。

Abstract: Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.

[63] Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

Tong Chen,Akari Asai,Luke Zettlemoyer,Hannaneh Hajishirzi,Faeze Brahman

Main category: cs.CL

TL;DR: 论文提出了一种基于二元检索增强奖励(binary RAR)的在线强化学习方法,有效减少语言模型的外在幻觉现象,同时不影响开放生成和下游任务的性能。

Details Motivation: 语言模型在生成内容时容易出现与训练数据不符的事实性错误(外在幻觉),现有方法在缓解这一问题时往往牺牲了模型的开放生成能力和下游任务表现,限制了实用性。

Contribution: 1. 提出了一种二元检索增强奖励(binary RAR)机制,仅在输出完全正确时给予奖励。2. 在多项任务中展示了该方法在减少幻觉现象方面的显著效果(如开放生成任务中幻觉率降低39.3%),同时保持其他任务的性能不变。

Method: 采用在线强化学习框架,结合二元奖励机制:模型输出完全正确时奖励为1,否则为0。通过这种方法训练模型在事实性不足时选择”我不知道”的回答。

Result: 1. 开放生成任务中幻觉率降低39.3%。2. 在短问答任务中,模型学会了校准性放弃(calibrated abstention),减少了44.4%(PopQA)和21.7%(GPQA)的错误回答。3. 其他任务(如指令遵循、数学、代码)的性能未受影响。

Insight: 二元奖励机制比连续奖励更能有效平衡事实性和任务性能,同时模型学会了在不确定时选择合理的保守行为,提升了实用性。

Abstract: Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model’s output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting “I don’t know” when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.

[64] Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Austin Xu,Xuan-Phi Nguyen,Yilun Zhou,Chien-Sheng Wu,Caiming Xiong,Shafiq Joty

Main category: cs.CL

TL;DR: 论文提出了FARE,一种基于大规模数据的自动评估器,通过简单的监督微调方法训练,在多个评估任务中表现出色,甚至超越了更大的专用评估器。

Details Motivation: 随着训练和测试阶段对可扩展评估需求的增加,现有方法主要关注新方法论(如强化学习),而忽略了数据驱动的大规模开发。本文专注于数据扩展,以填补这一空白。

Contribution: 1. 构建了一个包含2.5M样本的数据集,涵盖五种评估任务和多个推理评估领域。2. 提出了FARE评估器家族(8B和20B参数),通过简单的监督微调方法实现高性能。

Method: 使用迭代拒绝采样的监督微调方法(SFT)训练FARE评估器,数据集覆盖多种评估任务和推理领域。

Result: FARE-8B表现优于更大的专用RL评估器,FARE-20B成为开源评估器的新标杆,超越70B+专用评估器。在实际任务中,FARE-20B在MATH上接近oracle性能,并在RL训练中显著提升下游模型表现。

Insight: 数据驱动的方法可以显著提升评估器的性能,简单的监督微调方法也能达到甚至超越复杂的RL方法。

Abstract: Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

[65] Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Akshara Prabhakar,Roshan Ram,Zixiang Chen,Silvio Savarese,Frank Wang,Caiming Xiong,Huan Wang,Weiran Yao

Main category: cs.CL

TL;DR: 论文提出了Enterprise Deep Research(EDR),一个多智能体系统,用于企业数据分析。系统通过主规划智能体和多个专用搜索智能体的协作,结合扩展的工具生态系统和可视化模块,实现了自动化报告生成和企业无缝部署,并在开放基准测试中优于现有方法。

Details Motivation: 企业面临将海量非结构化数据转化为可操作洞察的压力,现有智能体系统在领域特异性、意图对齐和企业集成方面存在局限。

Contribution: 1. 提出了EDR多智能体系统,包含主规划智能体和多个专用搜索智能体;2. 支持可扩展的工具生态系统和数据可视化;3. 引入反思机制优化研究方向;4. 在开放基准测试中表现优于现有方法。

Method: 1. 主规划智能体分解查询;2. 四个专用搜索智能体(通用、学术、GitHub、LinkedIn)执行任务;3. 可扩展工具生态系统支持NL2SQL和文件分析;4. 可视化智能体提供数据洞察;5. 反思机制动态调整研究方向。

Result: EDR在DeepResearch Bench和DeepConsult等开放基准测试中优于现有智能体系统。

Insight: 多智能体协作结合专用工具和反思机制能显著提升企业数据分析的效率和准确性。

Abstract: As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications. Code at https://github.com/SalesforceAIResearch/enterprise-deep-research and Dataset at https://huggingface.co/datasets/Salesforce/EDR-200

cs.CV [Back]

[66] ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Jiani Huang,Amish Sethi,Matthew Kuo,Mayank Keoliya,Neelay Velingker,JungHo Jung,Ser-Nam Lim,Ziyang Li,Mayur Naik

Main category: cs.CV

TL;DR: 论文提出ESCA框架,通过结构化时空理解提升具身智能体的情境化能力。其核心是SGClip,一种基于CLIP的开放域提示式场景图生成模型,通过神经符号学习训练,无需人工标注。

Details Motivation: 当前多模态大语言模型(MLLMs)训练主要依赖高层视觉-声音-文本对,缺乏细粒度结构化对齐。ESCA旨在解决这一问题,提升智能体的空间-时间理解能力。

Contribution: 1. 提出SGClip模型,支持提示式和任务特定微调的场景图生成;2. ESCA框架显著减少感知错误,并提升开源MLLMs性能。

Method: 1. SGClip通过神经符号学习从8.7万+开放域视频中自监督训练;2. 结合视频-字幕对和结构化推理,无需人工标注;3. 支持场景图生成和动作定位任务。

Result: SGClip在场景图生成和动作定位基准中表现优异,ESCA显著提升了开源及商业MLLMs的性能,并在两个具身环境中达到SOTA。

Insight: 结构化视觉语义对齐(如场景图)可有效提升具身智能体的感知能力;自监督与神经符号学习的结合避免了昂贵的人工标注。

Abstract: Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, current training pipelines primarily rely on high-level vision-sound-text pairs and lack fine-grained, structured alignment between pixel-level visual content and textual semantics. To overcome this challenge, we propose ESCA, a new framework for contextualizing embodied agents through structured spatial-temporal understanding. At its core is SGClip, a novel CLIP-based, open-domain, and promptable model for generating scene graphs. SGClip is trained on 87K+ open-domain videos via a neurosymbolic learning pipeline, which harnesses model-driven self-supervision from video-caption pairs and structured reasoning, thereby eliminating the need for human-labeled scene graph annotations. We demonstrate that SGClip supports both prompt-based inference and task-specific fine-tuning, excelling in scene graph generation and action localization benchmarks. ESCA with SGClip consistently improves both open-source and commercial MLLMs, achieving state-of-the-art performance across two embodied environments. Notably, it significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines.

[67] CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection

Huiming Yang

Main category: cs.CV

TL;DR: CrossRay3D提出了稀疏多模态3D检测器的改进方法,通过几何结构和类别分布的指导,解决了现有方法中令牌表示质量不足的问题,显著提升了性能。

Details Motivation: 稀疏多模态检测器虽然在计算成本和下游任务适应性上有优势,但其令牌表示质量较差,限制了性能表现。作者认为几何结构和类别分布是关键改进点。

Contribution: 1.提出了Sparse Selector(SS)模块,包括Ray-Aware Supervision(RAS)和Class-Balanced Supervision。2.设计了Ray Positional Encoding(Ray PE)解决LiDAR与图像模态的分布差异。3.综合这些模块构建了CrossRay3D框架,实现了高效的多模态3D检测。

Method: 1.RAS在训练阶段保留几何信息;Class-Balanced Supervision通过自适应权重调整确保小物体令牌被保留。2.Ray PE解决模态分布差异。3.整合为一个端到端的稀疏多模态检测器CrossRay3D。

Result: 在nuScenes基准测试中,CrossRay3D达到72.4 mAP和74.7 NDS,速度为其他领先方法的1.84倍,且在模态缺失时表现稳健。

Insight: 几何结构和类别分布是提升稀疏多模态检测器性能的关键,模态间的分布对齐也能显著改善模型表现。

Abstract: The sparse cross-modality detector offers more advantages than its counterpart, the Bird’s-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.

[68] InfraGPT Smart Infrastructure: An End-to-End VLM-Based Framework for Detecting and Managing Urban Defects

Ibrahim Sheikh Mohamed,Abdullah Yahya Abdullah Omaisan

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉语言模型(VLM)的端到端框架InfraGPT,用于智能基础设施中城市缺陷的检测与管理。通过结合YOLO目标检测器和VLM,系统能够实现多缺陷检测和结构化修复建议生成。

Details Motivation: 城市基础设施的缺陷(如裂缝、坑洞和泄漏)威胁公共安全,传统人工检查成本高且危险,现有自动化系统通常仅针对单一缺陷或输出非结构化内容,难以直接指导维护工作。

Contribution: 1) 提出了一个端到端的综合框架,结合YOLO目标检测器和VLM;2) 生成结构化JSON格式的修复计划,包括描述、工具推荐和紧急警报;3) 在公共数据集和CCTV片段上验证了系统的准确性。

Method: 1) 使用YOLO系列检测器进行多缺陷检测与分割;2) 将检测结果输入VLM,生成场景感知的结构化修复计划;3) 输出包含详细信息的JSON文件。

Result: 系统在公共数据集和实际CCTV片段中表现优异,能准确识别多种缺陷并生成连贯的修复摘要。

Insight: InfraGPT展示了VLM在城市基础设施维护中的潜力,但扩展到城市规模部署仍需解决数据规模和实时性等挑战。

Abstract: Infrastructure in smart cities is increasingly monitored by networks of closed circuit television (CCTV) cameras. Roads, bridges and tunnels develop cracks, potholes, and fluid leaks that threaten public safety and require timely repair. Manual inspection is costly and hazardous, and existing automatic systems typically address individual defect types or provide unstructured outputs that cannot directly guide maintenance crews. This paper proposes a comprehensive pipeline that leverages street CCTV streams for multi defect detection and segmentation using the YOLO family of object detectors and passes the detections to a vision language model (VLM) for scene aware summarization. The VLM generates a structured action plan in JSON format that includes incident descriptions, recommended tools, dimensions, repair plans, and urgent alerts. We review literature on pothole, crack and leak detection, highlight recent advances in large vision language models such as QwenVL and LLaVA, and describe the design of our early prototype. Experimental evaluation on public datasets and captured CCTV clips demonstrates that the system accurately identifies diverse defects and produces coherent summaries. We conclude by discussing challenges and directions for scaling the system to city wide deployments.

[69] IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection

Zewen Li,Zitong Yu,Qilang Ye,Weicheng Xie,Wei Zhuo,Linlin Shen

Main category: cs.CV

TL;DR: IAD-GPT将多模态大型语言模型(MLLMs)用于工业异常检测(IAD),通过生成详细异常提示和增强视觉定位能力,提升检测与分割性能。

Details Motivation: 传统IAD方法缺乏多轮对话和详细描述能力,而现有大模型方法在异常检测任务中潜力未被充分挖掘。

Contribution: 提出IAD-GPT框架,结合文本语义与图像信息,设计了异常提示生成器(APG)、文本引导增强器和多掩码融合模块。

Method: 利用APG生成异常提示,通过文本引导增强器动态选择特征增强路径,并引入多掩码融合模块提升像素级异常感知。

Result: 在MVTec-AD和VisA数据集上实现了自监督和少样本异常检测与分割的SOTA性能。

Insight: 通过文本与视觉特征交互,MLLMs在工业异常检测中展现出更强的视觉定位和语义理解能力。

Abstract: The robust causal capability of Multimodal Large Language Models (MLLMs) hold the potential of detecting defective objects in Industrial Anomaly Detection (IAD). However, most traditional IAD methods lack the ability to provide multi-turn human-machine dialogues and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pre-trained models have not fully stimulated the ability of large models in anomaly detection tasks. In this paper, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ Abnormal Prompt Generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pre-trained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose Text-Guided Enhancer, wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a Multi-Mask Fusion module to incorporate mask as expert knowledge, which enhances the LLM’s perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at \href{https://github.com/LiZeWen1225/IAD-GPT}{https://github.com/LiZeWen1225/IAD-GPT}.

[70] ObjectTransforms for Uncertainty Quantification and Reduction in Vision-Based Perception for Autonomous Vehicles

Nishad Sahu,Shounak Sural,Aditya Satish Patil,Ragunathan,Rajkumar

Main category: cs.CV

TL;DR: 本文提出ObjectTransforms方法,通过在训练和推理阶段对目标进行特定变换,量化并减少基于视觉的目标检测中的不确定性。该方法利用颜色空间扰动和扩散模型增强数据多样性,并通过检测分数的方差实时量化不确定性,显著提升了检测精度和鲁棒性。

Details Motivation: 自动驾驶中基于视觉的目标检测因数据偏差和分布偏移等问题存在不确定性,可能导致安全隐患。本文旨在通过目标特定的变换量化并减少这种不确定性,提升感知的可靠性。

Contribution: 1. 提出ObjectTransforms方法,通过在训练和推理阶段对目标进行变换,量化并减少不确定性;2. 使用颜色空间扰动和扩散模型增强数据多样性;3. 实验证明该方法显著提升检测精度和不确定性量化能力。

Method: 1. 训练阶段:对单个目标进行颜色空间扰动,增强光照和颜色变化的鲁棒性;利用扩散模型生成多样化的行人实例。2. 推理阶段:对检测到的目标进行扰动,通过检测分数的方差实时量化不确定性,并过滤误检和恢复漏检。

Result: 在NuImages 10K数据集上使用YOLOv8的实验表明,ObjectTransforms显著提升了所有目标类别的检测精度,并在推理阶段对误检赋予更高的不确定性值。

Insight: ObjectTransforms是一种轻量级但有效的方法,能够分别在训练和推理阶段减少和量化不确定性,为自动驾驶感知系统的可靠性提供了新思路。

Abstract: Reliable perception is fundamental for safety critical decision making in autonomous driving. Yet, vision based object detector neural networks remain vulnerable to uncertainty arising from issues such as data bias and distributional shifts. In this paper, we introduce ObjectTransforms, a technique for quantifying and reducing uncertainty in vision based object detection through object specific transformations at both training and inference times. At training time, ObjectTransforms perform color space perturbations on individual objects, improving robustness to lighting and color variations. ObjectTransforms also uses diffusion models to generate realistic, diverse pedestrian instances. At inference time, object perturbations are applied to detected objects and the variance of detection scores are used to quantify predictive uncertainty in real time. This uncertainty signal is then used to filter out false positives and also recover false negatives, improving the overall precision recall curve. Experiments with YOLOv8 on the NuImages 10K dataset demonstrate that our method yields notable accuracy improvements and uncertainty reduction across all object classes during training, while predicting desirably higher uncertainty values for false positives as compared to true positives during inference. Our results highlight the potential of ObjectTransforms as a lightweight yet effective mechanism for reducing and quantifying uncertainty in vision-based perception during training and inference respectively.

[71] Aria Gen 2 Pilot Dataset

Chen Kong,James Fort,Aria Kang,Jonathan Wittmer,Simon Green,Tianwei Shen,Yipu Zhao,Cheng Peng,Gustavo Solaira,Andrew Berkovich,Nikhil Raina,Vijay Baiyya,Evgeniy Oleinik,Eric Huang,Fan Zhang,Julian Straub,Mark Schwesinger,Luis Pesqueira,Xiaqing Pan,Jakob Julian Engel,Carl Ren,Mingfei Yan,Richard Newcombe

Main category: cs.CV

TL;DR: Aria Gen 2 Pilot Dataset(A2PD)是一个多模态的第一视角数据集,使用Aria Gen 2眼镜捕获,旨在为研究提供全面的传感器数据和多感知算法的输出。

Details Motivation: 当前缺乏高质量的多模态第一视角数据集,特别是在复杂日常场景中的应用。A2PD旨在填补这一空白,支持研究者和开发者开发更强大的感知算法。

Contribution: 主要贡献是发布了一个开放的多模态数据集,包含五种日常场景的传感器数据和感知算法输出,为研究提供了丰富资源。

Method: 数据集通过Aria Gen 2眼镜捕获,包括多个用户的日常活动,如清洁、烹饪、户外行走等,并提供原始传感器数据和多感知算法的输出。

Result: 数据集展示了设备在不同用户和条件下感知用户、环境和交互的稳健性能。数据集逐步开放并提供开源工具。

Insight: A2PD的发布不仅填补了研究空白,还通过开源工具支持社区的进一步开发和创新。

Abstract: The Aria Gen 2 Pilot Dataset (A2PD) is an egocentric multimodal open dataset captured using the state-of-the-art Aria Gen 2 glasses. To facilitate timely access, A2PD is released incrementally with ongoing dataset enhancements. The initial release features Dia’ane, our primary subject, who records her daily activities alongside friends, each equipped with Aria Gen 2 glasses. It encompasses five primary scenarios: cleaning, cooking, eating, playing, and outdoor walking. In each of the scenarios, we provide comprehensive raw sensor data and output data from various machine perception algorithms. These data illustrate the device’s ability to perceive the wearer, the surrounding environment, and interactions between the wearer and the environment, while maintaining robust performance across diverse users and conditions. The A2PD is publicly available at projectaria.com, with open-source tools and usage examples provided in Project Aria Tools.

[72] DuetMatch: Harmonizing Semi-Supervised Brain MRI Segmentation via Decoupled Branch Optimization

Thanh-Huy Nguyen,Hoang-Thien Nguyen,Vi Vu,Ba-Thinh Lam,Phat Huynh,Tianyang Wang,Xingjian Li,Ulas Bagci,Min Xu

Main category: cs.CV

TL;DR: DuetMatch提出了一种双分支半监督学习框架,通过异步优化分支和解耦扰动提升脑MRI分割的性能和稳定性。

Details Motivation: 医学影像标注数据有限,半监督学习因其可以从非完美监督中学习的能力而受到关注。然而,联合优化整个网络可能阻碍收敛和稳定性。

Contribution: 提出了DuetMatch双分支框架,引入解耦扰动和一致性匹配,提升了半监督分割的性能和鲁棒性。

Method: 双分支异步优化,分支分别优化编码器或解码器;通过解耦扰动和成对CutMix交叉引导增强正则化和多样性;使用一致性匹配减少伪标签噪声。

Result: 在ISLES2022和BraTS等数据集上,DuetMatch表现优于当前最佳方法,验证了其有效性。

Insight: 异步优化和解耦扰动能有效提升半监督学习的收敛性和稳定性,尤其在噪声环境下表现优异。

Abstract: The limited availability of annotated data in medical imaging makes semi-supervised learning increasingly appealing for its ability to learn from imperfect supervision. Recently, teacher-student frameworks have gained popularity for their training benefits and robust performance. However, jointly optimizing the entire network can hinder convergence and stability, especially in challenging scenarios. To address this for medical image segmentation, we propose DuetMatch, a novel dual-branch semi-supervised framework with asynchronous optimization, where each branch optimizes either the encoder or decoder while keeping the other frozen. To improve consistency under noisy conditions, we introduce Decoupled Dropout Perturbation, enforcing regularization across branches. We also design Pair-wise CutMix Cross-Guidance to enhance model diversity by exchanging pseudo-labels through augmented input pairs. To mitigate confirmation bias from noisy pseudo-labels, we propose Consistency Matching, refining labels using stable predictions from frozen teacher models. Extensive experiments on benchmark brain MRI segmentation datasets, including ISLES2022 and BraTS, show that DuetMatch consistently outperforms state-of-the-art methods, demonstrating its effectiveness and robustness across diverse semi-supervised segmentation scenarios.

[73] Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

Zheng Huang,Enpei Zhang,Yinghao Cai,Weikang Qiu,Carl Yang,Elynn Chen,Xiang Zhang,Rex Ying,Dawei Zhou,Yujun Yan

Main category: cs.CV

TL;DR: 该论文探索了从fMRI信号重建视觉刺激的方法,提出了一种名为PRISM的模型,通过将fMRI信号投影到结构化的文本空间作为中间表示,结合对象中心扩散模块和属性关系搜索模块,显著提升了重建质量。

Details Motivation: 研究目标是理解大脑如何编码视觉信息,并通过fMRI信号重建视觉刺激。现有方法在潜在空间选择和生成模型适配方面存在不足,需要更有效的中介表示。

Contribution: 1. 发现fMRI信号与语言模型的文本空间更相似;2. 提出PRISM模型,通过结构化文本空间和对象导向的生成模块提升重建质量。

Method: PRISM模型包括两个模块:对象中心扩散模块和属性关系搜索模块。前者专注于生成对象以减少检测错误,后者自动识别关键属性和关系以对齐神经活动。

Result: 在真实数据集上的实验表明,PRISM比现有方法表现更好,感知损失降低了8%。

Insight: 结构化文本空间是连接fMRI信号和图像重建的重要桥梁,能更有效地捕捉视觉刺激的组合性特征。

Abstract: Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.

[74] StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales

Nyle Siddiqui,Rohit Gupta,Sirnam Swetha,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出了一种名为StretchySnake的灵活训练方法,用于解决视频理解中SSM模型的时空不灵活性问题,显著提升了跨时空分辨率动作识别的性能。

Details Motivation: 现有的视频模型通常以固定分辨率和时长训练,导致在训练未见过的时空分辨率视频上表现下降。SSM模型的线性复杂性和隐状态递归特性为解决这一问题提供了潜力,但现有训练方法未能充分发挥其优势。

Contribution: 提出了一种灵活的SSM训练方法,通过在训练时采样不同时空分辨率的视频并动态插值模型权重,使模型能够适应不同尺度的视频。

Method: 设计了五种不同的灵活训练变体,通过动态调整时空分辨率和模型权重,增强了SSM模型的适应性。

Result: 在短动作(UCF-101, HMDB-51)和长动作(COIN, Breakfast)基准测试中,StretchySnake性能最高提升28%,并在细粒度动作(SSV2, Diving-48)上表现出色。

Insight: SSM模型通过动态训练方法可以显著提升时空灵活性,成为一种高效、鲁棒的视频动作识别解决方案。

Abstract: State space models (SSMs) have emerged as a competitive alternative to transformers in various tasks. Their linear complexity and hidden-state recurrence make them particularly attractive for modeling long sequences, whereas attention becomes quadratically expensive. However, current training methods for video understanding are tailored towards transformers and fail to fully leverage the unique attributes of SSMs. For example, video models are often trained at a fixed resolution and video length to balance the quadratic scaling of attention cost against performance. Consequently, these models suffer from degraded performance when evaluated on videos with spatial and temporal resolutions unseen during training; a property we call spatio-temporal inflexibility. In the context of action recognition, this severely limits a model’s ability to retain performance across both short- and long-form videos. Therefore, we propose a flexible training method that leverages and improves the inherent adaptability of SSMs. Our method samples videos at varying temporal and spatial resolutions during training and dynamically interpolates model weights to accommodate any spatio-temporal scale. This instills our SSM, which we call StretchySnake, with spatio-temporal flexibility and enables it to seamlessly handle videos ranging from short, fine-grained clips to long, complex activities. We introduce and compare five different variants of flexible training, and identify the most effective strategy for video SSMs. On short-action (UCF-101, HMDB-51) and long-action (COIN, Breakfast) benchmarks, StretchySnake outperforms transformer and SSM baselines alike by up to 28%, with strong adaptability to fine-grained actions (SSV2, Diving-48). Therefore, our method provides a simple drop-in training recipe that makes video SSMs more robust, resolution-agnostic, and efficient across diverse action recognition scenarios.

[75] VM-BeautyNet: A Synergistic Ensemble of Vision Transformer and Mamba for Facial Beauty Prediction

Djamel Eddine Boukhari

Main category: cs.CV

TL;DR: 论文提出了一种名为VM-BeautyNet的新型异构集成架构,结合了Vision Transformer和Mamba-based Vision模型的优势,用于面部美观预测任务,取得了最先进的性能表现。

Details Motivation: 现有的卷积神经网络(CNN)在面部美观预测任务中难以捕捉全局的面部特征,而Vision Transformers(ViT)虽然解决了这一问题,但其二次复杂度成为瓶颈。因此,需要一种既能高效建模长程依赖,又能保持全局特征提取能力的解决方案。

Contribution: 论文的主要贡献是提出了VM-BeautyNet,结合了ViT和Mamba-based Vision模型的互补优势,实现了高效且全面的面部美观预测。

Method: VM-BeautyNet采用了异构集成架构,ViT用于捕捉全局面部结构和对称性,Mamba则通过线性复杂度建模长程依赖关系,专注于序列特征和纹理。

Result: 在SCUT-FBP5500数据集上,VM-BeautyNet取得了Pearson Correlation(PC)0.9212、Mean Absolute Error(MAE)0.2085和Root Mean Square Error(RMSE)0.2698的优异表现。

Insight: 通过Grad-CAM可视化分析,论文揭示了两个模型的互补特征提取机制,为计算美学提供了新的模型范式。

Abstract: Facial Beauty Prediction (FBP) is a complex and challenging computer vision task, aiming to model the subjective and intricate nature of human aesthetic perception. While deep learning models, particularly Convolutional Neural Networks (CNNs), have made significant strides, they often struggle to capture the global, holistic facial features that are critical to human judgment. Vision Transformers (ViT) address this by effectively modeling long-range spatial relationships, but their quadratic complexity can be a bottleneck. This paper introduces a novel, heterogeneous ensemble architecture, \textbf{VM-BeautyNet}, that synergistically fuses the complementary strengths of a Vision Transformer and a Mamba-based Vision model, a recent advancement in State-Space Models (SSMs). The ViT backbone excels at capturing global facial structure and symmetry, while the Mamba backbone efficiently models long-range dependencies with linear complexity, focusing on sequential features and textures. We evaluate our approach on the benchmark SCUT-FBP5500 dataset. Our proposed VM-BeautyNet achieves state-of-the-art performance, with a \textbf{Pearson Correlation (PC) of 0.9212}, a \textbf{Mean Absolute Error (MAE) of 0.2085}, and a \textbf{Root Mean Square Error (RMSE) of 0.2698}. Furthermore, through Grad-CAM visualizations, we provide interpretability analysis that confirms the complementary feature extraction of the two backbones, offering new insights into the model’s decision-making process and presenting a powerful new architectural paradigm for computational aesthetics.

[76] Designing a Convolutional Neural Network for High-Accuracy Oral Cavity Squamous Cell Carcinoma (OCSCC) Detection

Vishal Manikanden,Aniketh Bandlamudi,Daniel Haehn

Main category: cs.CV

TL;DR: 这篇论文提出了一种卷积神经网络(CNN)设计,用于高准确率检测口腔鳞状细胞癌(OCSCC),并分析了图像分辨率对检测效果的影响。

Details Motivation: OCSCC是头颈部最常见的癌症,但由于早期症状隐蔽且发展缓慢,往往难以早期发现,导致可预防的死亡。通过CNN的精确图像分割和模式识别能力,可以实现早期检测。

Contribution: 1. 开发了一个训练有素的CNN,用于识别OCSCC;2. 设计了图像捕捉和处理硬件,提高了检测效率;3. 分析了图像分辨率对预测准确率的影响。

Method: 1. 使用4293张包含良性、恶性肿瘤及阴性样本的图像训练CNN;2. 评估CNN的精确率、召回率和mAP;3. 测试不同分辨率图像对预测准确率的影响;4. 设计与CNN集成的硬件和应用。

Result: CNN在测试数据集上表现出色,图像分辨率的提高对预测准确率有对数级增长,但高像素数量存在收益递减现象。硬件增强了图像的捕捉质量。

Insight: 1. CNN在OCSCC早期检测中具有潜力;2. 图像分辨率对检测效果影响显著,但需权衡像素数量与收益;3. 硬件与软件的集成为实际应用提供了可能性。

Abstract: Oral Cavity Squamous Cell Carcinoma (OCSCC) is the most common type of head and neck cancer. Due to the subtle nature of its early stages, deep and hidden areas of development, and slow growth, OCSCC often goes undetected, leading to preventable deaths. However, properly trained Convolutional Neural Networks (CNNs), with their precise image segmentation techniques and ability to apply kernel matrices to modify the RGB values of images for accurate image pattern recognition, would be an effective means for early detection of OCSCC. Pairing this neural network with image capturing and processing hardware would allow increased efficacy in OCSCC detection. The aim of our project is to develop a Convolutional Neural Network trained to recognize OCSCC, as well as to design a physical hardware system to capture and process detailed images, in order to determine the image quality required for accurate predictions. A CNN was trained on 4293 training images consisting of benign and malignant tumors, as well as negative samples, and was evaluated for its precision, recall, and Mean Average Precision (mAP) in its predictions of OCSCC. A testing dataset of randomly assorted images of cancerous, non-cancerous, and negative images was chosen, and each image was altered to represent 5 common resolutions. This test data set was thoroughly analyzed by the CNN and predictions were scored on the basis of accuracy. The designed enhancement hardware was used to capture detailed images, and its impact was scored. An application was developed to facilitate the testing process and bring open access to the CNN. Images of increasing resolution resulted in higher-accuracy predictions on a logarithmic scale, demonstrating the diminishing returns of higher pixel counts.

[77] Embody 3D: A Large-scale Multimodal Motion and Behavior Dataset

Claire McLean,Makenzie Meendering,Tristan Swartz,Orri Gabbay,Alexandra Olsen,Rachel Jacobs,Nicholas Rosen,Philippe de Bree,Tony Garcia,Gadsden Merrill,Jake Sandakly,Julia Buffalini,Neham Jain,Steven Krenn,Moneish Kumar,Dejan Markovic,Evonne Ng,Fabian Prada,Andrew Saba,Siwei Zhang,Vasu Agrawal,Tim Godisart,Alexander Richard,Michael Zollhoefer

Main category: cs.CV

TL;DR: Embody 3D是一个大规模多模态运动和行为数据集,包含500小时的3D运动数据,涵盖单人运动和多人在不同场景下的行为数据。

Details Motivation: 为研究人类运动和行为的多样性提供高质量的多模态数据支持。

Contribution: 推出了Embody 3D数据集,包含丰富的3D运动、手势、情感对话和多人在协作场景中的数据。

Method: 使用多摄像头捕捉439名参与者的3D运动数据,并提供身体追踪、手部追踪、文本标注和独立音频轨道。

Result: 数据集包含54 million帧的3D运动数据,涵盖广泛的运动和交互场景。

Insight: 该数据集为研究人类运动和社交行为提供了多样化的多模态资源。

Abstract: The Codec Avatars Lab at Meta introduces Embody 3D, a multimodal dataset of 500 individual hours of 3D motion data from 439 participants collected in a multi-camera collection stage, amounting to over 54 million frames of tracked 3D motion. The dataset features a wide range of single-person motion data, including prompted motions, hand gestures, and locomotion; as well as multi-person behavioral and conversational data like discussions, conversations in different emotional states, collaborative activities, and co-living scenarios in an apartment-like space. We provide tracked human motion including hand tracking and body shape, text annotations, and a separate audio track for each participant.

[78] Proactive Scene Decomposition and Reconstruction

Baicheng Li,Zike Yan,Dong Wu,Hongbin Zha

Main category: cs.CV

TL;DR: 论文提出了一种主动场景分解与重构方法,通过人类行为动态调整场景建模,结合高斯溅射技术实现高效渲染。

Details Motivation: 传统静态物体级重构方法难以处理动态场景中的模糊性,而人类行为蕴含丰富动态线索,为解决这一问题提供了新思路。

Contribution: 论文的主要贡献是提出了一种基于人类-物体交互的动态场景分解与重构框架,整合了相机姿态估计、实例分解等多项任务,实现了高效建模与渲染。

Method: 通过观察人类意图驱动的交互行为,动态调整高斯溅射技术支持的场景分解与重构过程,实现了渐进式建模。

Result: 在多个真实场景中验证了方法的有效性,展现了优于传统方法的渲染质量和效率。

Insight: 人类行为是动态建模的关键线索,结合高效渲染技术可实现更灵活的在线场景重构。

Abstract: Human behaviors are the major causes of scene dynamics and inherently contain rich cues regarding the dynamics. This paper formalizes a new task of proactive scene decomposition and reconstruction, an online approach that leverages human-object interactions to iteratively disassemble and reconstruct the environment. By observing these intentional interactions, we can dynamically refine the decomposition and reconstruction process, addressing inherent ambiguities in static object-level reconstruction. The proposed system effectively integrates multiple tasks in dynamic environments such as accurate camera and object pose estimation, instance decomposition, and online map updating, capitalizing on cues from human-object interactions in egocentric live streams for a flexible, progressive alternative to conventional object-level reconstruction methods. Aided by the Gaussian splatting technique, accurate and consistent dynamic scene modeling is achieved with photorealistic and efficient rendering. The efficacy is validated in multiple real-world scenarios with promising advantages.

[79] Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models

Yue Zheng,Xiufang Shi,Jiming Chen,Yuanchao Shu

Main category: cs.CV

TL;DR: Cerberus是一个实时视频异常检测系统,通过两阶段的级联视觉语言模型(VLM)实现高效且准确的检测,解决了现有方法的计算成本高和视觉定位不稳定的问题。

Details Motivation: 现有的基于视觉语言模型的视频异常检测方法虽然具备零样本能力,但计算成本高且视觉定位不稳定,难以实现实时部署。

Contribution: 1. 提出Cerberus,一种两阶段级联系统,结合轻量级过滤和细粒度VLM推理;2. 运动掩码提示和基于规则的偏差检测两项关键创新。

Method: 离线学习正常行为规则,在线推理时采用轻量级过滤和VLM细粒度推理;通过运动掩码提示和规则偏差检测优化性能。

Result: 在四个数据集上的实验表明,Cerberus平均达到57.68 fps(151.79倍加速),97.2%准确率,与SOTA方法相当。

Insight: 通过级联设计和针对性创新,Cerberus实现了高效且高精度的实时视频异常检测,展示了实用性和可扩展性。

Abstract: Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM’s attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.

[80] OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models

Ryoto Miyamoto,Xin Fan,Fuyuko Kido,Tsuneo Matsumoto,Hayato Yamana

Main category: cs.CV

TL;DR: OpenLVLM-MIA是一个新基准,揭示了在大规模视觉语言模型(LVLM)上评估成员推理攻击(MIA)的基本挑战。该基准通过平衡成员和非成员样本分布,表明现有MIA方法的性能在无偏条件下接近随机猜测。

Details Motivation: 现有MIA研究在LVLM上的高成功率可能源于数据集构建中的分布偏差,而非真实的成员状态识别。因此,需要一个透明且无偏的基准来评估MIA方法的实际有效性。

Contribution: 提出了OpenLVLM-MIA基准,包含6,000张图像,平衡了成员和非成员样本的分布,并提供了三个训练阶段的真实成员标签。

Method: 通过设计一个严格控制分布的基准数据集,消除了分布偏差对MIA评估的影响,并在三个训练阶段上验证MIA方法的性能。

Result: 实验表明,现有MIA方法在无偏条件下性能接近随机猜测,揭示了这些方法的局限性。

Insight: MIA方法的表现高度依赖于数据分布的平衡性。未来隐私保护技术的研究需建立在更透明和无偏的评估基础上。

Abstract: OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs). While prior work has reported high attack success rates, our analysis suggests that these results often arise from detecting distributional bias introduced during dataset construction rather than from identifying true membership status. To address this issue, we introduce a controlled benchmark of 6{,}000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages. Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods converged to random chance under unbiased conditions. By offering a transparent and unbiased benchmark, OpenLVLM-MIA clarifies the current limitations of MIA research on LVLMs and provides a solid foundation for developing stronger privacy-preserving techniques.

[81] Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation

Rui Yang,Huining Li,Yiyi Long,Xiaojun Wu,Shengfeng He

Main category: cs.CV

TL;DR: Stroke2Sketch是一种无需训练的框架,通过跨图像笔画注意力机制实现了参考风格笔画属性的精确转移,同时在保持语义结构和内容保真度方面表现优异。

Details Motivation: 传统方法难以在生成草图时精确转移参考风格的笔画属性(如线条粗细、变形和纹理稀疏性),同时保持语义结构和内容保真度。

Contribution: 提出了Stroke2Sketch框架,引入跨图像笔画注意力机制,实现无训练的笔画属性转移;结合自适应对比增强和语义聚焦注意力,强化内容保留和前景突出。

Method: 采用跨图像笔画注意力机制嵌入自注意力层,建立细粒度语义对应;通过自适应对比增强和语义聚焦注意力优化内容保留和前景强调。

Result: 生成的草图在表达性笔画控制和语义连贯性上优于现有方法,接近手工绘制效果。

Insight: 无需训练的框架能够在风格迁移任务中兼顾笔画属性的精确转移和语义结构的保持,为草图生成提供了新思路。

Abstract: Generating sketches guided by reference styles requires precise transfer of stroke attributes, such as line thickness, deformation, and texture sparsity, while preserving semantic structure and content fidelity. To this end, we propose Stroke2Sketch, a novel training-free framework that introduces cross-image stroke attention, a mechanism embedded within self-attention layers to establish fine-grained semantic correspondences and enable accurate stroke attribute transfer. This allows our method to adaptively integrate reference stroke characteristics into content images while maintaining structural integrity. Additionally, we develop adaptive contrast enhancement and semantic-focused attention to reinforce content preservation and foreground emphasis. Stroke2Sketch effectively synthesizes stylistically faithful sketches that closely resemble handcrafted results, outperforming existing methods in expressive stroke control and semantic coherence. Codes are available at https://github.com/rane7/Stroke2Sketch.

[82] Scaling Laws for Deepfake Detection

Wenhao Wang,Longqi Cai,Taihong Xiao,Yuxiao Wang,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 本文系统地研究了深度伪造检测任务的缩放定律,分析了模型性能与真实图像域数量、深度伪造生成方法和训练图像数量的关系。通过构建ScaleDF数据集(包含580万张真实图像和880万张伪造图像),研究发现检测误差随域或方法数量的增加遵循幂律衰减规律。

Details Motivation: 深度伪造技术的快速发展对检测提出了更高要求,但目前缺乏对检测任务缩放规律的系统研究。

Contribution: 1. 构建了最大规模的深度伪造数据集ScaleDF;2. 揭示了检测误差随域和方法数量增加的幂律缩放规律;3. 提出了数据驱动的应对深度伪造演化的策略。

Method: 使用ScaleDF数据集,通过实验分析检测误差与真实域数量、深度伪造方法数量之间的关系,并验证幂律缩放规律。

Result: 检测误差随域或方法数量的增加呈现幂律衰减,预训练和数据增强在缩放中发挥重要作用。

Insight: 1. 幂律缩放规律可用于预测性能提升所需的额外资源;2. 数据驱动的策略有助于对抗技术演化;3. 缩放本身存在局限性。

Abstract: This paper presents a systematic study of scaling laws for the deepfake detection task. Specifically, we analyze the model performance against the number of real image domains, deepfake generation methods, and training images. Since no existing dataset meets the scale requirements for this research, we construct ScaleDF, the largest dataset to date in this field, which contains over 5.8 million real images from 51 different datasets (domains) and more than 8.8 million fake images generated by 102 deepfake methods. Using ScaleDF, we observe power-law scaling similar to that shown in large language models (LLMs). Specifically, the average detection error follows a predictable power-law decay as either the number of real domains or the number of deepfake methods increases. This key observation not only allows us to forecast the number of additional real domains or deepfake methods required to reach a target performance, but also inspires us to counter the evolving deepfake technology in a data-centric manner. Beyond this, we examine the role of pre-training and data augmentations in deepfake detection under scaling, as well as the limitations of scaling itself.

[83] RL makes MLLMs see better than SFT

Junha Song,Sangdoo Yun,Dongyoon Han,Jaegul Choo,Byeongho Heo

Main category: cs.CV

TL;DR: 论文发现,与监督微调(SFT)相比,强化学习(RL)在多模态语言模型(MLLM)的视觉编码器中产生了更强且定位更准确的视觉表示,提出了高效的优化方法PIVOT,显著提升了MLLM的性能。

Details Motivation: 研究揭示了目前MLLM研究中忽视了视觉编码器的作用,尤其是在RL训练范式下对视觉表示的影响。

Contribution: 论文的主要贡献是:1)揭示了RL相比SFT在MLLM视觉任务中的优势;2)提出了PIVOT方法,高效优化视觉编码器;3)展示了PIVOT在低计算成本下超越更大模型的能力。

Method: 通过实验对比RL和SFT对不同任务的影响,分析了视觉编码器的变化,并提出了PIVOT方法。

Result: 实验表明,RL训练的视觉编码器性能更强,PIVOT方法在低计算成本下超越传统方法。

Insight: RL训练范式不仅能提升MLLM下游任务表现,还能重塑视觉表示,为视觉编码器的优化提供了新思路。

Abstract: A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM’s post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM’s underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/

[84] On the Provable Importance of Gradients for Language-Assisted Image Clustering

Bo Peng,Jie Lu,Guangquan Zhang,Zhen Fang

Main category: cs.CV

TL;DR: 本文提出了GradNorm框架,用于解决语言辅助图像聚类(LaIC)中如何从无标注语料中筛选出语义接近图像的正面名词的问题。该方法通过梯度传播度量名词的正面性,理论上优于现有方法,并在实验中取得了最佳性能。

Details Motivation: LaIC问题的核心挑战在于如何从无标注语料中筛选出语义接近图像的正面名词。现有方法主要依赖CLIP学习的特征空间,但缺乏理论支持。

Contribution: 提出了GradNorm框架,通过梯度传播度量名词的正面性,并提供了理论上的误差边界证明其优越性。

Method: 基于交叉熵梯度的回传幅度度量名词的正面性,理论证明该方法涵盖现有策略为特例。

Result: 在多个基准数据集上实现了最先进的聚类性能。

Insight: 梯度信息在LaIC任务中具有理论保证的重要性,GradNorm为筛选正面名词提供了更优的方法。

Abstract: This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks.

[85] MIRAD - A comprehensive real-world robust anomaly detection dataset for Mass Individualization

Pulin Li,Guocheng Wu,Li Yin,Yuxin Zheng,Wei Zhang,Yanjie Zhou

Main category: cs.CV

TL;DR: 该论文提出了MIRAD数据集,旨在解决社交制造中因高度定制化、小批量生产和分散环境导致的异常检测难题。数据集包含多样化的产品、多地点采集的数据和成像异质性,并通过评估现有方法发现性能显著下降,为工业5.0提供了研究基础。

Details Motivation: 社交制造的模式导致质量控制的挑战,尤其是缺陷检测在高度定制化、小批量生产和分散环境下的困难。现有的数据集和方法未能充分解决这些问题,因此需要一个新的数据集来支持研究。

Contribution: 1.提出了首个针对社交制造异常检测的数据集MIRAD;2.覆盖了产品的多样化、多地点数据和成像异质性;3.评估了多种SOTA方法,揭示了其在真实场景中的局限性。

Method: 1.构建了一个包含多样化产品、多地点数据和成像异质性的数据集;2.通过实验评估了多种异常检测方法(单类、多类和零样本方法)。

Result: 实验结果表明,现有方法在MIRAD数据集上的性能显著下降,凸显了真实场景中异常检测的未解决复杂性。

Insight: 真实工业场景中的异常检测需要更强的鲁棒性和适应性,尤其在定制化和小批量生产的环境下,现有方法的局限性明显。

Abstract: Social manufacturing leverages community collaboration and scattered resources to realize mass individualization in modern industry. However, this paradigm shift also introduces substantial challenges in quality control, particularly in defect detection. The main difficulties stem from three aspects. First, products often have highly customized configurations. Second, production typically involves fragmented, small-batch orders. Third, imaging environments vary considerably across distributed sites. To overcome the scarcity of real-world datasets and tailored algorithms, we introduce the Mass Individualization Robust Anomaly Detection (MIRAD) dataset. As the first benchmark explicitly designed for anomaly detection in social manufacturing, MIRAD captures three critical dimensions of this domain: (1) diverse individualized products with large intra-class variation, (2) data collected from six geographically dispersed manufacturing nodes, and (3) substantial imaging heterogeneity, including variations in lighting, background, and motion conditions. We then conduct extensive evaluations of state-of-the-art (SOTA) anomaly detection methods on MIRAD, covering one-class, multi-class, and zero-shot approaches. Results show a significant performance drop across all models compared with conventional benchmarks, highlighting the unresolved complexities of defect detection in real-world individualized production. By bridging industrial requirements and academic research, MIRAD provides a realistic foundation for developing robust quality control solutions essential for Industry 5.0. The dataset is publicly available at https://github.com/wu33learn/MIRAD.

[86] Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

Mohammad Javad Ahmadi,Iman Gandomi,Parisa Abdi,Seyed-Farzad Mohammadi,Amirhossein Taslimi,Mehdi Khodaparast,Hassan Hashemi,Mahdi Tavakoli,Hamid D. Taghirad

Main category: cs.CV

TL;DR: 论文提出了一个大规模、多源、多任务的白内障手术视频数据集Cataract-LMM,旨在解决现有数据集多样性和标注深度不足的问题,支持手术AI任务的基准测试。

Details Motivation: 当前手术视频数据集缺乏多样性和深度标注,限制了深度学习模型的泛化能力。为解决这一问题,作者构建了一个大规模、多来源的白内障手术视频数据集。

Contribution: 1. 发布了包含3000个白内障手术视频的数据集,涵盖不同经验水平的外科医生;2. 提供了四层丰富的标注信息;3. 设立了手术AI任务的基准测试。

Method: 1. 收集来自两个手术中心的视频数据;2. 添加四层标注(手术阶段、实例分割、器械-组织交互、技能评分);3. 通过基准实验验证数据质量。

Result: 数据集支持手术AI任务(如工作流识别、场景分割、技能评估)的高效训练和评估,并提供了领域适应的基线性能。

Insight: 大规模、多来源和多任务标注的数据集对提升手术AI模型的泛化能力和实用性至关重要。

Abstract: The development of computer-assisted surgery systems depends on large-scale, annotated datasets. Current resources for cataract surgery often lack the diversity and annotation depth needed to train generalizable deep-learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos from two surgical centers, performed by surgeons with a range of experience levels. This resource is enriched with four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores based on the established competency rubrics like the ICO-OSCAR. The technical quality of the dataset is supported by a series of benchmarking experiments for key surgical AI tasks, including workflow recognition, scene segmentation, and automated skill assessment. Furthermore, we establish a domain adaptation baseline for the phase recognition task by training a model on a subset of surgical centers and evaluating its performance on a held-out center. The dataset and annotations are available in Google Form (https://docs.google.com/forms/d/e/1FAIpQLSfmyMAPSTGrIy2sTnz0-TMw08ZagTimRulbAQcWdaPwDy187A/viewform?usp=dialog).

[87] iWatchRoadv2: Pothole Detection, Geospatial Mapping, and Intelligent Road Governance

Rishi Raj Sahoo,Surbhi Saswati Mohanty,Subhankar Mishra

Main category: cs.CV

TL;DR: iWatchRoadv2是一个端到端的自动化平台,通过YOLO模型实时检测道路坑洞,结合GPS和OCR技术实现精准地理标记,并提供智能治理功能,助力道路维护和透明治理。

Details Motivation: 道路坑洞对交通安全和维护构成重大挑战,尤其是在印度等道路维护不善的地区。缺乏自动化和透明化的解决方案促使开发了这一平台。

Contribution: 1) 收集并标注了包含多样化印度道路条件的7000多帧数据集;2) 基于YOLO模型的实时坑洞检测系统;3) 结合GPS和OCR的精准地理标记;4) 引入智能治理功能,支持自动化责任追究和保修管理。

Method: 1) 使用自标注数据集微调Ultralytics YOLO模型;2) 同步OCR提取的时间戳与外部GPS日志;3) 构建优化的后端数据库管理元数据;4) 设计直观的Web界面供公众和官员使用。

Result: 平台实现了道路坑洞从检测到修复验证的全生命周期自动化,提升了道路维护效率和治理透明度。

Insight: 通过整合计算机视觉、地理信息系统和智能治理,证明了自动化技术在城市基础设施管理中的巨大潜力。

Abstract: Road potholes pose significant safety hazards and maintenance challenges, particularly on India’s diverse and under-maintained road networks. This paper presents iWatchRoadv2, a fully automated end-to-end platform for real-time pothole detection, GPS-based geotagging, and dynamic road health visualization using OpenStreetMap (OSM). We curated a self-annotated dataset of over 7,000 dashcam frames capturing diverse Indian road conditions, weather patterns, and lighting scenarios, which we used to fine-tune the Ultralytics YOLO model for accurate pothole detection. The system synchronizes OCR-extracted video timestamps with external GPS logs to precisely geolocate each detected pothole, enriching detections with comprehensive metadata, including road segment attribution and contractor information managed through an optimized backend database. iWatchRoadv2 introduces intelligent governance features that enable authorities to link road segments with contract metadata through a secure login interface. The system automatically sends alerts to contractors and officials when road health deteriorates, supporting automated accountability and warranty enforcement. The intuitive web interface delivers actionable analytics to stakeholders and the public, facilitating evidence-driven repair planning, budget allocation, and quality assessment. Our cost-effective and scalable solution streamlines frame processing and storage while supporting seamless public engagement for urban and rural deployments. By automating the complete pothole monitoring lifecycle, from detection to repair verification, iWatchRoadv2 enables data-driven smart city management, transparent governance, and sustainable improvements in road infrastructure maintenance. The platform and live demonstration are accessible at https://smlab.niser.ac.in/project/iwatchroad.

[88] Demeter: A Parametric Model of Crop Plant Morphology from the Real World

Tianhang Cheng,Albert J. Zhai,Evan Z. Chen,Rui Zhou,Yawen Deng,Zitong Li,Kejie Zhao,Janice Shiu,Qianyu Zhao,Yide Xu,Xinlei Wang,Yuan Shen,Sheng Wang,Lisa Ainsworth,Kaiyu Guan,Shenlong Wang

Main category: cs.CV

TL;DR: Demeter是一个数据驱动的参数化模型,专注于作物植物的形态建模,能够处理不同物种的形状拓扑变化,同时考虑了关节化、子部件形状变化和非刚性变形三种形状变化来源。

Details Motivation: 当前缺乏对植物形态的通用参数化模型,特别是在作物植物领域。尽管人类和动物的参数化模型已较为成熟,但针对植物的类似模型尚不完善。

Contribution: Demeter首次提出了针对作物植物的参数化模型,能够处理多物种的形状拓扑变化,并同时建模三种形状变化来源。此外,团队还发布了大规模的真实世界作物数据集。

Method: Demeter采用数据驱动的方法,从真实世界数据中学习植物形态的关键因素(拓扑、形状、关节化和变形),并通过紧凑的表示形式进行编码。

Result: 实验表明,Demeter能够有效合成植物形状、重建结构,并模拟生物物理过程,验证了其在作物植物建模中的实用性。

Insight: 这项工作填补了植物参数化建模的空白,特别是针对作物植物的应用场景,为农业领域的3D重建和仿真提供了新工具。

Abstract: Learning 3D parametric shape models of objects has gained popularity in vision and graphics and has showed broad utility in 3D reconstruction, generation, understanding, and simulation. While powerful models exist for humans and animals, equally expressive approaches for modeling plants are lacking. In this work, we present Demeter, a data-driven parametric model that encodes key factors of a plant morphology, including topology, shape, articulation, and deformation into a compact learned representation. Unlike previous parametric models, Demeter handles varying shape topology across various species and models three sources of shape variation: articulation, subcomponent shape variation, and non-rigid deformation. To advance crop plant modeling, we collected a large-scale, ground-truthed dataset from a soybean farm as a testbed. Experiments show that Demeter effectively synthesizes shapes, reconstructs structures, and simulates biophysical processes. Code and data is available at https://tianhang-cheng.github.io/Demeter/.

[89] REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

Changyue Shi,Minghao Chen,Yiping Mao,Chuxiao Yang,Xinyuan Hu,Jiajun Ding,Zhou Yu

Main category: cs.CV

TL;DR: REALM是一个基于多模态大语言模型(MLLM)的框架,通过3D高斯泼溅表示实现开放世界的推理分割与编辑,无需大量3D特定后训练。

Details Motivation: 现有3D分割方法难以处理模糊的推理指令,而擅长推理的2D视觉语言模型缺乏3D空间理解能力。REALM旨在填补这一空白。

Contribution: 提出了REALM框架,结合3D高斯泼溅表示与MLLM的推理能力,实现了开放世界的3D分割与编辑;并提出全局到局部空间锚定策略以提高鲁棒性。

Method: 使用3D高斯泼溅表示生成多视角渲染图,通过MLLM进行粗粒度定位;再合成局部视角进行细粒度分割,生成精确3D掩码。

Result: 在LERF、3D-OVS和REALM3D基准测试中表现优异,支持物体移除、替换和风格迁移等3D交互任务。

Insight: 结合MLLM的推理能力与3D高斯泼溅渲染的优势,实现了无需大量后训练的开放世界3D理解与交互。

Abstract: Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility. Project page: https://ChangyueShi.github.io/REALM.

[90] SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Xiaojun Guo,Runyu Zhou,Yifei Wang,Qi Zhang,Chenheng Zhang,Stefanie Jegelka,Xiaohan Wang,Jiajun Chai,Guojun Yin,Wei Lin,Yisen Wang

Main category: cs.CV

TL;DR: SSL4RL是一个新颖的框架,利用自监督学习任务作为可验证的奖励信号,用于视觉语言模型的强化学习微调,显著提升了视觉中心和视觉语言推理任务的性能。

Details Motivation: 视觉语言模型(VLMs)在整合视觉输入和语言模型方面表现出色,但常常未能充分利用视觉证据,或依赖于语言先验或文本捷径。尽管强化学习(RL)可以对齐模型行为,但其在VLMs中的应用因缺乏可扩展和可靠的奖励机制而受阻。

Contribution: 1. 提出了SSL4RL框架,通过自监督学习任务生成密集的自动奖励信号;2. 系统性地研究了影响SSL4RL任务效果的关键因素;3. 展示了该框架在图学习中的通用性。

Method: 1. 将自监督学习目标(如图像旋转预测或掩码重构)转化为密集奖励信号;2. 无需人工偏好数据或不可靠的AI评估器;3. 适用于视觉中心和视觉语言推理任务。

Result: 实验表明,SSL4RL显著提升了视觉中心和视觉语言推理任务的性能,并在图学习中取得了显著增益。

Insight: 研究发现,任务难度、模型规模和目标领域的语义对齐是关键影响因素,为未来工作提供了新的设计原则。

Abstract: Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework’s generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

[91] LightGlueStick: a Fast and Robust Glue for Joint Point-Line Matching

Aidyn Ubingazhibov,Rémi Pautrat,Iago Suárez,Shaohui Liu,Marc Pollefeys,Viktor Larsson

Main category: cs.CV

TL;DR: LightGlueStick是一种轻量级的联合点线匹配器,通过新颖的注意力线消息传递(ALMP)显式地利用线的连通性,提升了匹配效率,并在多个基准测试中实现了最先进的性能。

Details Motivation: 传统点线和线匹配被视为独立任务,计算复杂度高且难以实时应用。GlueStick虽提出了联合匹配,但架构较重,限制了实时性。LightGlueStick旨在设计轻量级联合匹配器。

Contribution: 提出了轻量级联合点线匹配器LightGlueStick,引入注意力线消息传递(ALMP)显式利用线连通性,提升了匹配效率。

Method: 采用ALMP模块,显式地将线段的连通性融入网络,节点间高效通信,实现轻量化和高效匹配。

Result: 在不同基准测试中达到最新性能,证明了方法的有效性和高效性。

Insight: 联合处理点和线特征并显式利用其连通性,可显著提升匹配性能,同时保持轻量化和实时性。

Abstract: Lines and points are complementary local features, whose combination has proven effective for applications such as SLAM and Structure-from-Motion. The backbone of these pipelines are the local feature matchers, establishing correspondences across images. Traditionally, point and line matching have been treated as independent tasks. Recently, GlueStick proposed a GNN-based network that simultaneously operates on points and lines to establish matches. While running a single joint matching reduced the overall computational complexity, the heavy architecture prevented real-time applications or deployment to edge devices. Inspired by recent progress in point matching, we propose LightGlueStick, a lightweight matcher for points and line segments. The key novel component in our architecture is the Attentional Line Message Passing (ALMP), which explicitly exposes the connectivity of the lines to the network, allowing for efficient communication between nodes. In thorough experiments we show that LightGlueStick establishes a new state-of-the-art across different benchmarks. The code is available at https://github.com/aubingazhib/LightGlueStick.

[92] EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

Haoran Sun,Chen Cai,Huiping Zhuang,Kong Aik Lee,Lap-Pui Chau,Yi Wang

Main category: cs.CV

TL;DR: 论文提出了一种可解释的深度伪造视频检测(EDVD)任务,设计了EDVD-LLaMA多模态大语言模型推理框架,不仅能准确检测伪造视频,还能提供可信的解释和追踪推理过程。

Details Motivation: 深度伪造视频技术的快速发展不仅促进了艺术创作,也更容易传播虚假信息。传统深度伪造视频检测方法缺乏透明性和对新兴伪造技术的泛化能力,亟需能识别伪造内容并提供可验证解释的检测器。

Contribution: 1. 提出了可解释的深度伪造视频检测(EDVD)任务;2. 设计了EDVD-LLaMA多模态推理框架;3. 构建了时空细微信息标记化(ST-SIT)和细粒度多模态思维链(Fg-MCoT)机制;4. 发布了Explainable Reasoning FF++基准数据集(ER-FF++set)。

Method: 1. 使用ST-SIT提取并融合全局和局部跨帧深度伪造特征;2. 引入Fg-MCoT机制,结合面部特征数据作为硬约束,实现像素级时空视频定位;3. 构建ER-FF++set数据集支持双监督推理和检测。

Result: EDVD-LLaMA在检测精度、可解释性及跨伪造方法和跨数据集场景下表现出卓越性能和鲁棒性,相比传统方法更优且更透明。

Insight: 通过多模态大语言模型引入可解释推理,解决了传统深度伪造检测方法的黑箱问题,同时提升了检测能力。

Abstract: The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.

[93] RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

Kunyu Peng,Di Wen,Jia Fu,Jiamin Wu,Kailun Yang,Junwei Zheng,Ruiping Liu,Yufan Chen,Yuqian Fu,Danda Pani Paudel,Luc Van Gool,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: RefAtomNet++提出了一种新的框架,用于解决Referring Atomic Video Action Recognition(RAVAR)任务中的跨模态对齐问题,通过多层级语义对齐的跨注意力机制和多轨迹Mamba模型显著提升了性能。

Details Motivation: RAVAR任务要求精确的语言引导动作理解,尤其在复杂多人场景中尤为重要。现有的方法(如RefAtomNet)在跨模态信息对齐和目标人物定位方面表现不足,影响了细粒度动作预测。

Contribution: 1. 扩展了RefAVA数据集为RefAVA++,包含290万帧和75.1k标注人物;2. 提出了RefAtomNet++模型,结合多层级语义对齐跨注意力和多轨迹Mamba模型,提升了跨模态对齐能力。

Method: RefAtomNet++采用多层级语义对齐的跨注意力机制和多轨迹Mamba模型,动态选择视觉空间标记,并在部分关键词、场景属性和整体句子层级上建模扫描轨迹。

Result: 实验表明RefAtomNet++在RAVAR任务中取得了state-of-the-art的性能。

Insight: 多层级语义对齐和动态轨迹建模能够有效解决跨模态对齐问题,提升细粒度动作识别的准确性。

Abstract: Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.

[94] Enhancing Rotated Object Detection via Anisotropic Gaussian Bounding Box and Bhattacharyya Distance

Chien Thai,Mai Xuan Trang,Huong Ninh,Hoang Hiep Ly,Anh Son Le

Main category: cs.CV

TL;DR: 该论文提出了一种改进的损失函数,通过高斯边界框表示和Bhattacharyya距离来提升旋转目标检测的精度和鲁棒性。

Details Motivation: 传统目标检测方法在旋转目标检测中表现不佳,主要因为无法有效捕捉方向变化,因此需要一种更精确且鲁棒的方法。

Contribution: 1. 提出一种基于各向异性高斯边界框和Bhattacharyya距离的损失函数;2. 解决了各向同性方差在方形物体中的问题。

Method: 方法包括使用各向异性高斯表示和改进的旋转不变损失函数,将其集成到现有的深度学习旋转目标检测器中。

Result: 实验结果在平均精度(mAP)上显著优于现有方法,展示了方法的有效性。

Insight: 各向异性高斯表示和Bhattacharyya距离的结合,为解决旋转目标检测问题提供了新思路。

Abstract: Detecting rotated objects accurately and efficiently is a significant challenge in computer vision, particularly in applications such as aerial imagery, remote sensing, and autonomous driving. Although traditional object detection frameworks are effective for axis-aligned objects, they often underperform in scenarios involving rotated objects due to their limitations in capturing orientation variations. This paper introduces an improved loss function aimed at enhancing detection accuracy and robustness by leveraging the Gaussian bounding box representation and Bhattacharyya distance. In addition, we advocate for the use of an anisotropic Gaussian representation to address the issues associated with isotropic variance in square-like objects. Our proposed method addresses these challenges by incorporating a rotation-invariant loss function that effectively captures the geometric properties of rotated objects. We integrate this proposed loss function into state-of-the-art deep learning-based rotated object detection detectors, and extensive experiments demonstrated significant improvements in mean Average Precision metrics compared to existing methods. The results highlight the potential of our approach to establish new benchmark in rotated object detection, with implications for a wide range of applications requiring precise and reliable object localization irrespective of orientation.

[95] NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation

Peiran Xu,Xicheng Gong,Yadong MU

Main category: cs.CV

TL;DR: 论文提出了一种基于Q学习的Vision-and-Language Navigation(VLN)方法NavQ,通过学习室内场景布局和物体关系的通用知识,生成Q特征以指导前瞻性导航决策。

Details Motivation: 现有VLN方法多依赖历史信息决策,忽略了动作的未来影响和长期结果。

Contribution: 1)提出了利用大规模无标注轨迹数据训练的Q模型,学习场景通用知识;2)设计了跨模态未来编码器,结合任务无关的Q特征和导航指令生成动作评分;3)通过A*搜索策略优化导航路径。

Method: 1)Q-learning训练Q模型生成Q特征;2)跨模态编码器整合Q特征与指令;3)结合历史与未来评分进行A*搜索。

Result: 在主流目标导向VLN数据集上验证了方法的有效性。

Insight: 学习场景通用知识并通过前瞻性决策可以显著提升导航性能。

Abstract: In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.

[96] HGC-Avatar: Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars

Haocheng Tang,Ruoke Yan,Xinhui Yin,Qi Zhang,Xinfeng Zhang,Siwei Ma,Wen Gao,Chuanmin Jia

Main category: cs.CV

TL;DR: HGC-Avatar提出了一种分层高斯压缩框架,用于动态3D头像的高效传输和高质量渲染,解决了现有3DGS方法在流式传输和人脸真实感上的局限性。

Details Motivation: 现有的3D高斯泼溅(3DGS)方法在动态3D场景编码和传输中缺乏人类先验知识,导致比特率和重建质量不佳,限制了其在流式3D头像系统中的应用。

Contribution: 提出了HGC-Avatar框架,通过分层设计和人类模型先验,实现高效压缩和高质量渲染,支持分层解码和可控渲染。

Method: 方法将高斯表示解耦为结构层(StyleUNet生成器映射姿态到高斯)和运动层(SMPL-X模型紧凑表示姿态变化),并引入面部注意力机制优化训练。

Result: 实验表明HGC-Avatar在视觉质量和压缩效率上显著优于现有方法,支持流式3D头像快速渲染。

Insight: 结合人类模型先验和分层设计可显著提升动态3D内容的传输表现,尤其是在低比特率下保持人脸真实感是关键。

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast, photorealistic rendering of dynamic 3D scenes, showing strong potential in immersive communication. However, in digital human encoding and transmission, the compression methods based on general 3DGS representations are limited by the lack of human priors, resulting in suboptimal bitrate efficiency and reconstruction quality at the decoder side, which hinders their application in streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical Gaussian Compression framework designed for efficient transmission and high-quality rendering of dynamic avatars. Our method disentangles the Gaussian representation into a structural layer, which maps poses to Gaussians via a StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model to represent temporal pose variations compactly and semantically. This hierarchical design supports layer-wise compression, progressive decoding, and controllable rendering from diverse pose inputs such as video sequences or text. Since people are most concerned with facial realism, we incorporate a facial attention mechanism during StyleUNet training to preserve identity and expression details under low-bitrate constraints. Experimental results demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar rendering, while significantly outperforming prior methods in both visual quality and compression efficiency.

[97] PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Lukas Selch,Yufang Hou,M. Jehanzeb Mirza,Sivan Doveh,James Glass,Rogerio Feris,Wei Lin

Main category: cs.CV

TL;DR: PRISMM-Bench是首个基于真实审稿人标注的科学论文多模态不一致性基准,包含262个不一致案例,旨在评估大模型在多模态不一致性检测、修正和匹配任务中的能力。

Details Motivation: 现有基准忽略了多模态不一致性的真实复杂性,PRISMM-Bench填补了这一空白,旨在推动可信科学助手的发展。

Contribution: 1. 首个基于真实审稿数据的多模态不一致性基准;2. 引入结构化JSON答案表示以减少语言偏见;3. 评估了21个主流大模型,揭示了其科学推理能力的不足。

Method: 通过审稿数据挖掘、LLM辅助过滤和人工验证的流程,构建了262个不一致案例,并设计了不一致性识别、修正和匹配三项任务。

Result: 主流模型在PRISMM-Bench上的表现较差(26.1-54.2%),凸显了多模态科学推理的挑战。

Insight: 多模态不一致性检测是科学助手可信度的关键,现有模型仍需改进;结构化答案表示能有效减少评估中的偏见。

Abstract: Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model’s capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.

[98] OOS-DSD: Improving Out-of-stock Detection in Retail Images using Auxiliary Tasks

Franko Šikić,Sven Lončarić

Main category: cs.CV

TL;DR: OOS-DSD提出了一种新型的深度学习方法,通过辅助任务改进零售图像中的缺货检测,结合产品分割和场景深度估计,性能超越现有最优方法1.8% mAP。

Details Motivation: 缺货(OOS)检测是零售验证中的关键任务,现有方法性能有限,引入辅助学习可以提升检测效果。

Contribution: 1) 提出OOS-DSD方法,扩展YOLOv8架构,结合产品分割和深度估计;2) 提出深度归一化方法以稳定训练;3) 在实验中性能超越SOTA方法。

Method: 扩展YOLOv8,增加卷积分支用于OOS检测、产品分割和深度估计,后者使用伪标签数据训练,并提出深度归一化方法。

Result: mAP提升1.8%,辅助学习和深度归一化分别贡献3.7%和4.2%的mAP提升。

Insight: 通过辅助任务(如深度估计)可以显著提升主任务性能,且伪标签数据的合理处理是关键。

Abstract: Out-of-stock (OOS) detection is a very important retail verification process that aims to infer the unavailability of products in their designated areas on the shelf. In this paper, we introduce OOS-DSD, a novel deep learning-based method that advances OOS detection through auxiliary learning. In particular, we extend a well-established YOLOv8 object detection architecture with additional convolutional branches to simultaneously detect OOS, segment products, and estimate scene depth. While OOS detection and product segmentation branches are trained using ground truth data, the depth estimation branch is trained using pseudo-labeled annotations produced by the state-of-the-art (SOTA) depth estimation model Depth Anything V2. Furthermore, since the aforementioned pseudo-labeled depth estimates display relative depth, we propose an appropriate depth normalization procedure that stabilizes the training process. The experimental results show that the proposed method surpassed the performance of the SOTA OOS detection methods by 1.8% of the mean average precision (mAP). In addition, ablation studies confirm the effectiveness of auxiliary learning and the proposed depth normalization procedure, with the former increasing mAP by 3.7% and the latter by 4.2%.

[99] Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions

Jihoon Kwon,Kyle Min,Jy-yong Sohn

Main category: cs.CV

TL;DR: READ方法通过引入重建和对齐目标来增强CLIP的组合推理能力,显著提升了性能。

Details Motivation: CLIP等模型在组合推理上表现不佳,因为文本编码器倾向于关注单个单词而非其关系。READ旨在解决这一问题。

Contribution: 提出了READ方法,通过重建和对齐目标增强CLIP的组合推理能力,并在多个基准测试中实现SOTA。

Method: READ引入两个辅助目标:1) 基于原始字幕嵌入的token级重建;2) 句子级对齐,确保改写句子的嵌入一致性。

Result: READ-CLIP在五个组合推理基准上表现最佳,比基线方法提升高达4.1%。

Insight: 重建目标帮助捕捉单词间关系,对齐目标确保语义一致性,二者互补。

Abstract: Despite recent advances, vision-language models trained with standard contrastive objectives still struggle with compositional reasoning – the ability to understand structured relationships between visual and linguistic elements. This shortcoming is largely due to the tendency of the text encoder to focus on individual words rather than their relations, a limitation reinforced by contrastive training that primarily aligns words with visual objects. In this paper, we introduce REconstruction and Alignment of text Descriptions (READ), a fine-tuning method designed to enhance compositional reasoning by adding two auxiliary objectives to the contrastive learning: (1) a token-level reconstruction objective, where a frozen pre-trained decoder reconstructs alternative captions based on the embedding of the original caption; and (2) a sentence-level alignment objective, which explicitly aligns paraphrased sentences in the embedding space. We show that READ-CLIP, a model derived by applying the READ method to the pre-trained CLIP model, achieves the state-of-the-art performance across five major compositional reasoning benchmarks, outperforming the strongest conventional fine-tuning baseline by up to 4.1%. Furthermore, applying the READ to existing CLIP variants (including NegCLIP and FSC-CLIP) also improves performance on these benchmarks. Quantitative and qualitative analyses reveal that our proposed objectives – reconstruction and alignment – offer complementary benefits: the former encourages the encoder to capture relationships between words within a caption, while the latter ensures consistent representations for paraphrases expressed with different wording.

[100] Fit for Purpose? Deepfake Detection in the Real World

Guangyu Lin,Li Lin,Christina P. Walker,Daniel S. Schiff,Shu Hu

Main category: cs.CV

TL;DR: 该论文提出了首个基于真实世界政治Deepfake事件的系统性基准(Political Deepfakes Incident Database),并评估了学术界、政府和行业的最新Deepfake检测工具。结果显示,现有检测器在真实政治Deepfake中的泛化能力较差,呼吁开发更具政治情境化的检测框架。

Details Motivation: 随着AI生成内容(如Deepfake)的快速扩散,尤其是在政治领域的虚假信息传播,现有的检测工具大多基于实验室合成数据训练,难以应对真实世界中的政治Deepfake。

Contribution: 1. 首次构建了基于真实世界政治Deepfake事件的数据库;2. 系统评估了学术界、政府和行业中最先进的Deepfake检测工具的性能。

Method: 1. 利用Political Deepfakes Incident Database作为真实数据的基准;2. 评估了学术界、政府和行业的多种Deepfake检测工具在不同领域的表现。

Result: 现有检测工具在真实政治Deepfake中表现不佳,尤其是视频领域;付费工具优于免费工具,但仍难以有效泛化。

Insight: 需开发更具政治情境化的Deepfake检测框架,以应对真实世界中的挑战。

Abstract: The rapid proliferation of AI-generated content, driven by advances in generative adversarial networks, diffusion models, and multimodal large language models, has made the creation and dissemination of synthetic media effortless, heightening the risks of misinformation, particularly political deepfakes that distort truth and undermine trust in political institutions. In turn, governments, research institutions, and industry have strongly promoted deepfake detection initiatives as solutions. Yet, most existing models are trained and validated on synthetic, laboratory-controlled datasets, limiting their generalizability to the kinds of real-world political deepfakes circulating on social platforms that affect the public. In this work, we introduce the first systematic benchmark based on the Political Deepfakes Incident Database, a curated collection of real-world political deepfakes shared on social media since 2018. Our study includes a systematic evaluation of state-of-the-art deepfake detectors across academia, government, and industry. We find that the detectors from academia and government perform relatively poorly. While paid detection tools achieve relatively higher performance than free-access models, all evaluated detectors struggle to generalize effectively to authentic political deepfakes, and are vulnerable to simple manipulations, especially in the video domain. Results urge the need for politically contextualized deepfake detection frameworks to better safeguard the public in real-world settings.

[101] SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense

Yiyang Huang,Liang Shi,Yitian Zhang,Yi Xu,Yun Fu

Main category: cs.CV

TL;DR: SHIELD是一个无需训练的训练框架,通过三种策略解决LVLM中的物体幻觉问题:(1)重新加权视觉tokens以减少统计偏差,(2)引入噪声tokens对抗固有偏差,(3)通过对抗攻击和对比解码解决脆弱性。实验表明其效果显著且具有广泛适用性。

Details Motivation: 大型视觉语言模型(LVLM)在多模态任务中表现出色,但物体幻觉(生成看似合理但不准确的描述)仍然是一个重要问题。本文首次将LVLM的幻觉问题追溯到视觉编码器,并揭示了统计偏差、固有偏差和脆弱性三个核心问题。

Contribution: 1. 首次将LVLM的物体幻觉问题归因于视觉编码器;2. 提出无需训练的框架SHIELD,通过三种策略针对性地解决统计偏差、固有偏差和脆弱性;3. 在多类基准测试和不同LVLM家族中验证了其有效性。

Method: SHIELD采用三种策略:(1)重新加权视觉tokens以减少统计偏差;(2)引入噪声tokens抑制固有偏差;(3)通过对抗攻击和对比解码增强模型鲁棒性。

Result: 实验证明SHIELD显著减少了LVLM中的物体幻觉问题,并在通用LVLM基准测试中表现优异,展示了其广泛适用性。

Insight: 视觉编码器是LVLM中物体幻觉的主要来源之一;统计和固有偏差以及脆弱性是关键的潜在问题;训练无关的框架可以在不影响模型性能的情况下有效缓解幻觉问题。

Abstract: Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.

[102] VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

Jiaying Zhu,Yurui Zhu,Xin Lu,Wenrui Yan,Dong Li,Kunlin Liu,Xueyang Fu,Zheng-Jun Zha

Main category: cs.CV

TL;DR: VisionSelector提出了一种端到端可学习的视觉令牌压缩框架,通过轻量化的评分模块和Top-K机制,解决了多模态大模型中视觉令牌过多的问题,显著提升了性能和效率。

Details Motivation: 多模态大模型(MLLMs)在处理高分辨率图像或多图像输入时,视觉令牌数量庞大,导致计算和内存瓶颈。现有压缩方法基于启发式规则,可能丢弃关键信息或引入偏见。

Contribution: 提出了VisionSelector,一种端到端可学习的令牌压缩框架,通过轻量化的评分模块和课程退火策略,自适应选择关键令牌,支持任意压缩率。

Method: 设计了可微分的Top-K机制和课程退火策略,训练评分模块以自适应选择令牌,无需修改MLLM主干。

Result: 在30%保留率下保持MME任务100%准确率,10%保留率下性能优于先前方法12.14%,预填充速度提升一倍。

Insight: 令牌压缩可以通过轻量化的端到端学习实现,自适应选择关键令牌的同时显著提升效率和性能。

Abstract: Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques are often constrained by heuristic rules that risk discarding critical information. They may suffer from biases, such as attention sinks, that lead to sharp performance drops under aggressive compression ratios. To address these limitations, we reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process. To be specific, we propose VisionSelector, a scorer module decoupled from the MLLM backbone that incorporates a differentiable Top-K mechanism and a curriculum annealing strategy to bridge the training-inference gap, enabling efficient and adaptive token selection various arbitrary compression rates. Remarkably lightweight with only 12.85M trainable parameters, VisionSelector demonstrates generalization across various compression rates and adaptively identifying critical tokens. This leads to superior performance across all compression budgets, evidenced by preserving 100% accuracy on MME with 30% retention budget, outperforming prior methods by 12.14% at 10% retention budget, and doubling prefill speed. Our code is available at https://github.com/JulietChoo/VisionSelector .

[103] Self-Supervised Learning to Fly using Efficient Semantic Segmentation and Metric Depth Estimation for Low-Cost Autonomous UAVs

Sebastian Mocanu,Emil Slusanschi,Marius Leordeanu

Main category: cs.CV

TL;DR: 该论文提出了一种仅依赖视觉的小型无人机自主飞行系统,结合语义分割和单目深度估计,通过自适应尺度因子算法实现准确的度量距离测量。系统使用知识蒸馏框架和轻量级U-Net网络,在室内环境中表现出色,飞行任务成功率100%。

Details Motivation: 目前低成本无人机的自主飞行系统通常依赖GPS或昂贵传感器(如LiDAR),限制了其在室内或无GPS环境的应用。论文旨在通过视觉技术实现低成本、高效的自主飞行方案。

Contribution: 1. 自适应尺度因子算法将非度量单目深度预测转换为准确度量距离;2. 结合语义分割和深度估计,实现避障和场景探索;3. 轻量级U-Net网络(1.6M参数)支持实时语义分割。

Method: 1. 使用知识蒸馏框架,SVM教师生成训练数据训练轻量级U-Net学生网络;2. 结合语义分割和单目深度估计实现导航;3. 通过端到端学习优化飞行策略。

Result: 系统在5x4米实验室环境中测试,30次真实飞行和100次数字孪生飞行中表现出色,距离误差14.4 cm,任务成功率100%。

Insight: 视觉技术在资源受限平台上可实现高效自主飞行;语义分割与深度估计的结合是低成本无人机导航的有效手段。

Abstract: This paper presents a vision-only autonomous flight system for small UAVs operating in controlled indoor environments. The system combines semantic segmentation with monocular depth estimation to enable obstacle avoidance, scene exploration, and autonomous safe landing operations without requiring GPS or expensive sensors such as LiDAR. A key innovation is an adaptive scale factor algorithm that converts non-metric monocular depth predictions into accurate metric distance measurements by leveraging semantic ground plane detection and camera intrinsic parameters, achieving a mean distance error of 14.4 cm. The approach uses a knowledge distillation framework where a color-based Support Vector Machine (SVM) teacher generates training data for a lightweight U-Net student network (1.6M parameters) capable of real-time semantic segmentation. For more complex environments, the SVM teacher can be replaced with a state-of-the-art segmentation model. Testing was conducted in a controlled 5x4 meter laboratory environment with eight cardboard obstacles simulating urban structures. Extensive validation across 30 flight tests in a real-world environment and 100 flight tests in a digital-twin environment demonstrates that the combined segmentation and depth approach increases the distance traveled during surveillance and reduces mission time while maintaining 100% success rates. The system is further optimized through end-to-end learning, where a compact student neural network learns complete flight policies from demonstration data generated by our best-performing method, achieving an 87.5% autonomous mission success rate. This work advances practical vision-based drone navigation in structured environments, demonstrating solutions for metric depth estimation and computational efficiency challenges that enable deployment on resource-constrained platforms.

[104] MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Young-Jun Lee,Byung-Kwan Lee,Jianshu Zhang,Yechan Hwang,Byungsoo Ko,Han-Gyu Kim,Dongyu Yao,Xuankun Rong,Eojin Joo,Seung-Ho Han,Bowon Ko,Ho-Jin Choi

Main category: cs.CV

TL;DR: 论文提出了一个名为MultiVerse的新型多轮对话评测基准,用于评估视觉与语言模型(VLMs)在多轮对话中的表现。其覆盖了广泛的对话场景和任务类型,并提出了一种基于检查表的自动化评测方法。

Details Motivation: 现有的多轮对话数据集未能全面捕捉用户在真实场景中的复杂对话需求,因此需要一个新的评测基准来衡量VLMs在多轮对话中的能力。

Contribution: 引入了MultiVerse,一个包含647个对话(平均每段4轮)的数据集,覆盖12个不同的VLM评测任务。并提出基于GPT-4o的自动化评测方法,评估37个关键指标。

Method: 从12个现有评测基准中构建多轮对话数据集,并提出基于检查表的评测方法,利用GPT-4o作为自动化评估工具。

Result: 评测了18个VLMs,发现即使是性能最强的模型(如GPT-4o)在多轮对话中也仅能达到50%的成功率,证明了数据集的挑战性。

Insight: 研究发现,提供完整的对话上下文可以显著提升小型或较弱模型的性能,强调了上下文学习的重要性。

Abstract: Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns - derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset’s challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.

[105] Structured Interfaces for Automated Reasoning with 3D Scene Graphs

Aaron Ray,Jacob Arkin,Harel Biggie,Chuchu Fan,Luca Carlone,Nicholas Roy

Main category: cs.CV

TL;DR: 论文提出一种基于检索增强生成(RAG)的方法,通过引入图数据库查询语言(Cypher)作为接口,解决了大语言模型(LLM)与3D场景图(3DSG)结合时的扩展性问题,显著提升了语言任务的性能并减少了计算开销。

Details Motivation: 为了解决现有方法将3D场景图编码为文本时无法扩展到大型或复杂场景的问题,论文提出了一种更高效的接口方式。

Contribution: 提出了一种使用Cypher查询语言作为LLM与3DSG交互接口的方法,显著提升了语言任务的性能和扩展性。

Method: 采用检索增强生成技术,将3DSG存储在图数据库中,并为LLM提供Cypher查询工具以检索相关信息。

Result: 实验结果表明,该方法在指令执行和场景问答任务中性能显著优于基线方法,且计算开销更低。

Insight: 通过结构化查询接口(如Cypher)可以高效地解决LLM与复杂3D场景图的交互问题,为未来的多模态交互提供了新思路。

Abstract: In order to provide a robot with the ability to understand and react to a user’s natural language inputs, the natural language must be connected to the robot’s underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text within the LLM’s context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud-based models. This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content. A video supplement is available at https://www.youtube.com/watch?v=zY_YI9giZSA.

[106] Universal and Transferable Attacks on Pathology Foundation Models

Yuntian Wang,Xilin Yang,Che-Yung Shen,Nir Pillar,Aydogan Ozcan

Main category: cs.CV

TL;DR: 论文提出了一种针对病理学基础模型的通用且可转移的对抗性扰动方法(UTAP),揭示了这些模型的关键漏洞。UTAP能够通过固定且微小的噪声模式,在多任务和多分布数据中导致模型性能下降。

Details Motivation: 病理学基础模型在医学领域应用广泛,但其鲁棒性尚未得到充分评估。本研究旨在揭示这些模型的潜在安全风险,并提供评估基准。

Contribution: 1. 提出UTAP方法,展示其通用性和可转移性;2. 揭示了病理学基础模型的脆弱性;3. 为模型鲁棒性评估提供了高标准基准。

Method: 通过深度学习优化生成固定的噪声模式UTAP,添加到病理图像中,干扰模型的特征表示能力。实验覆盖多种病理学基础模型和数据集。

Result: UTAP在多任务和未见数据中显著降低了模型性能,且噪声视觉不可感知。

Insight: UTAP不仅威胁特定模型或数据集,对新兴病理学基础模型及其应用构成了广泛威胁,强调了防御机制研究的紧迫性。

Abstract: We introduce Universal and Transferable Adversarial Perturbations (UTAP) for pathology foundation models that reveal critical vulnerabilities in their capabilities. Optimized using deep learning, UTAP comprises a fixed and weak noise pattern that, when added to a pathology image, systematically disrupts the feature representation capabilities of multiple pathology foundation models. Therefore, UTAP induces performance drops in downstream tasks that utilize foundation models, including misclassification across a wide range of unseen data distributions. In addition to compromising the model performance, we demonstrate two key features of UTAP: (1) universality: its perturbation can be applied across diverse field-of-views independent of the dataset that UTAP was developed on, and (2) transferability: its perturbation can successfully degrade the performance of various external, black-box pathology foundation models - never seen before. These two features indicate that UTAP is not a dedicated attack associated with a specific foundation model or image dataset, but rather constitutes a broad threat to various emerging pathology foundation models and their applications. We systematically evaluated UTAP across various state-of-the-art pathology foundation models on multiple datasets, causing a significant drop in their performance with visually imperceptible modifications to the input images using a fixed noise pattern. The development of these potent attacks establishes a critical, high-standard benchmark for model robustness evaluation, highlighting a need for advancing defense mechanisms and potentially providing the necessary assets for adversarial training to ensure the safe and reliable deployment of AI in pathology.

[107] HYDRA: HYbrid knowledge Distillation and spectral Reconstruction Algorithm for high channel hyperspectral camera applications

Christopher Thirgood,Oscar Mendez,Erin Ling,Jon Storey,Simon Hadfield

Main category: cs.CV

TL;DR: 论文提出了一种名为HYDRA的混合知识蒸馏和光谱重建算法,通过学习教师模型封装的高光谱隐式数据和学生模型从自然图像到教师模型编码域的映射,实现了高质量的光谱重建,性能优于现有方法。

Details Motivation: 当前的多尺度注意力(MSA)方法仅适用于稀疏光谱的重建,而现代高光谱传感器包含数百个通道,因此需要一种能处理高通道数高光谱图像(HSI)的通用光谱重建方法。

Contribution: 论文的主要贡献是提出了HYDRA框架,结合知识蒸馏和光谱重建,通过教师-学生模型架构和新颖的训练方法,显著提升了高通道数HSI的光谱重建质量和效率。

Method: HYDRA采用教师模型封装HSI的隐式数据,学生模型学习从自然图像到教师模型编码域的映射,并通过混合知识蒸馏方法进行训练。

Result: 该方法在所有指标上均达到SOTA性能,精度提升18%,且在不同通道深度下推理速度更快。

Insight: 知识蒸馏能够有效地将教师模型的高级表征传递给学生模型,从而改善光谱重建的泛化能力和效率。

Abstract: Hyperspectral images (HSI) promise to support a range of new applications in computer vision. Recent research has explored the feasibility of generalizable Spectral Reconstruction (SR), the problem of recovering a HSI from a natural three-channel color image in unseen scenarios. However, previous Multi-Scale Attention (MSA) works have only demonstrated sufficient generalizable results for very sparse spectra, while modern HSI sensors contain hundreds of channels. This paper introduces a novel approach to spectral reconstruction via our HYbrid knowledge Distillation and spectral Reconstruction Architecture (HYDRA). Using a Teacher model that encapsulates latent hyperspectral image data and a Student model that learns mappings from natural images to the Teacher’s encoded domain, alongside a novel training method, we achieve high-quality spectral reconstruction. This addresses key limitations of prior SR models, providing SOTA performance across all metrics, including an 18% boost in accuracy, and faster inference times than current SOTA models at various channel depths.

[108] Pursuing Minimal Sufficiency in Spatial Reasoning

Yejie Guo,Yunzhong Hou,Wufei Ma,Meng Tang,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 论文提出了一种名为MSSR的双智能体框架,通过构建最小充分信息集来解决空间推理中的冗余和不足问题,显著提升了性能。

Details Motivation: 空间推理能力是视觉语言模型(VLMs)的核心挑战之一,原因包括二维预训练的局限性以及冗余三维信息导致的推理失败。

Contribution: 提出了MSSR框架,包含感知智能体和推理智能体,专注于提取最小充分信息集(MSS),并通过闭环迭代优化实现高性能空间推理。

Method: 1. 感知智能体使用工具箱提取三维信息,包括新设计的SOG模块;2. 推理智能体迭代优化信息,剔除冗余并补充缺失。

Result: 在两个基准测试中实现了最先进的性能,同时生成可解释的推理路径。

Insight: 最小充分信息集的设计显著提升了模型的效率和准确性,同时提供了高质量的训练数据来源。

Abstract: Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.

[109] SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation

Huy Minh Nhat Nguyen,Triet Hoang Minh Dao,Chau Vinh Hoang Truong,Cuong Tuan Nguyen

Main category: cs.CV

TL;DR: SDPA++是一种自监督去噪通用框架,通过伪地面真值生成和块聚合策略,仅使用噪声OCT图像提升去噪效果。

Details Motivation: OCT图像中的固有散斑噪声和临床环境中配对数据获取困难,促使开发自监督方法。

Contribution: 提出了SDPA++框架,通过自融合和自监督去噪生成伪地面真值,并利用块聚合策略训练集成模型。

Method: 首先生成伪地面真值图像,再通过块聚合策略训练去噪模型,提升图像清晰度。

Result: 在VIP Cup数据集上验证了CNR、MSR、TP和EP等指标的提升,展示了临床应用的潜力。

Insight: 无需干净参考图像的自监督方法为OCT图像去噪提供了新思路,适用于临床实践。

Abstract: Optical Coherence Tomography (OCT) is a widely used non-invasive imaging technique that provides detailed three-dimensional views of the retina, which are essential for the early and accurate diagnosis of ocular diseases. Consequently, OCT image analysis and processing have emerged as key research areas in biomedical imaging. However, acquiring paired datasets of clean and real-world noisy OCT images for supervised denoising models remains a formidable challenge due to intrinsic speckle noise and practical constraints in clinical imaging environments. To address these issues, we propose SDPA++: A General Framework for Self-Supervised Denoising with Patch Aggregation. Our novel approach leverages only noisy OCT images by first generating pseudo-ground-truth images through self-fusion and self-supervised denoising. These refined images then serve as targets to train an ensemble of denoising models using a patch-based strategy that effectively enhances image clarity. Performance improvements are validated via metrics such as Contrast-to-Noise Ratio (CNR), Mean Square Ratio (MSR), Texture Preservation (TP), and Edge Preservation (EP) on the real-world dataset from the IEEE SPS Video and Image Processing Cup. Notably, the VIP Cup dataset contains only real-world noisy OCT images without clean references, highlighting our method’s potential for improving image quality and diagnostic outcomes in clinical practice.

[110] Connecting Domains and Contrasting Samples: A Ladder for Domain Generalization

Tianxin Wei,Yifan Chen,Xinrui He,Wenxuan Bao,Jingrui He

Main category: cs.CV

TL;DR: 本文提出了一种新的域泛化(DG)方法——域连接对比学习(DCCL),通过增强跨域的类内连接性和利用预训练表示,显著提升了模型在未见目标域上的泛化性能。

Details Motivation: 由于训练和测试样本之间的分布偏移问题,直接应用对比学习(CL)会损害域泛化的性能,因此需要一种能够增强跨域类内连接性的新方法。

Contribution: 1. 提出了DCCL方法,通过数据增强和跨域正样本增强类内连接性;2. 引入了模型锚定和生成变换损失,以利用预训练表示的类内连接性。

Method: DCCL结合了数据侧和模型侧的改进:数据侧引入激进的数据增强和跨域正样本,模型侧通过模型锚定和生成变换损失嵌入未见域特征。

Result: 在五个标准DG基准测试中,DCCL优于现有最先进方法,且无需域监督。

Insight: 增强类内连接性是提升DG性能的关键,而对比学习需结合域特性和预训练表示才能发挥最佳效果。

Abstract: Distribution shifts between training and testing samples frequently occur in practice and impede model generalization performance. This crucial challenge thereby motivates studies on domain generalization (DG), which aim to predict the label on unseen target domain data by solely using data from source domains. It is intuitive to conceive the class-separated representations learned in contrastive learning (CL) are able to improve DG, while the reality is quite the opposite: users observe directly applying CL deteriorates the performance. We analyze the phenomenon with the insights from CL theory and discover lack of intra-class connectivity in the DG setting causes the deficiency. We thus propose a new paradigm, domain-connecting contrastive learning (DCCL), to enhance the conceptual connectivity across domains and obtain generalizable representations for DG. On the data side, more aggressive data augmentation and cross-domain positive samples are introduced to improve intra-class connectivity. On the model side, to better embed the unseen test domains, we propose model anchoring to exploit the intra-class connectivity in pre-trained representations and complement the anchoring with generative transformation loss. Extensive experiments on five standard DG benchmarks are performed. The results verify that DCCL outperforms state-of-the-art baselines even without domain supervision. The detailed model implementation and the code are provided through https://github.com/weitianxin/DCCL

[111] Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

Xiongkun Linghu,Jiangyong Huang,Ziyu Zhu,Baoxiong Jia,Siyuan Huang

Main category: cs.CV

TL;DR: 本文提出了SCENECOT框架,首次将链式思维推理应用于3D场景理解,通过解耦复杂任务并构建视觉线索,显著提升了3D大语言模型的落地问答能力。

Details Motivation: 现有3D大语言模型在落地问答任务上表现不佳,主要因为缺乏对人类场景-对象关系的深入探索。本文旨在填补这一空白。

Contribution: 1. 提出SCENECOT框架,实现3D场景中的链式思维推理;2. 发布首个大规模3D场景链式思维推理数据集SCENECOT-185K。

Method: 1. 将复杂任务解耦为简单子问题;2. 基于多模态专家模块构建视觉线索;3. 通过数据集训练模型。

Result: 实验表明,SCENECOT在多个3D场景推理基准上表现优异,且问答一致性高。

Insight: 链式思维推理在3D场景理解中具有潜力,可扩展到更广泛的场景。

Abstract: Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

[112] Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models

Jianbiao Mei,Yu Yang,Xuemeng Yang,Licheng Wen,Jiajun Lv,Botian Shi,Yong Liu

Main category: cs.CV

TL;DR: 论文提出了一种称为IR-WM的隐式残差世界模型,专注于建模世界的当前状态和演化,通过仅预测‘残差’(即变化部分)提高了效率,并在nuScenes基准测试中表现出色。

Details Motivation: 端到端自动驾驶系统依赖视觉中心的世界模型来理解和预测环境,但现有方法在重建未来场景时效率低下,过多关注静态背景。

Contribution: 提出了IR-WM模型,通过隐式残差表示避免了冗余建模,并引入对齐模块减少时间累积误差,同时探索了预测与规划的耦合方式。

Method: 基于BEV表示当前状态,利用前一时刻的特征作为时序先验,仅预测变化部分;通过对齐模块校准误差;研究了预测-规划的耦合方案。

Result: 在nuScenes基准测试中,IR-WM在4D占用预测和轨迹规划任务上表现优异。

Insight: 隐式世界模型生成的未来状态显著提升了规划精度,证明了仅建模变化部分的效率优势。

Abstract: End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird’s-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the “residual”, i.e., the changes conditioned on the ego-vehicle’s actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.

[113] UKANFormer: Noise-Robust Semantic Segmentation for Coral Reef Mapping via a Kolmogorov-Arnold Network-Transformer Hybrid

Tianyang Dou,Ming Li,Jiangying Qin,Xuan Liao,Jiageng Zhong,Armin Gruen,Mengyi Deng

Main category: cs.CV

TL;DR: 论文提出了一种新型语义分割模型UKANFormer,用于在噪声监督下实现珊瑚礁的高精度映射,通过结合UKAN架构和全局-局部Transformer模块,显著提升了模型的鲁棒性和分割精度。

Details Motivation: 珊瑚礁是重要但脆弱的生态系统,现有的全球珊瑚礁分布图(如Allen Coral Atlas)在空间精度和语义一致性上存在不足,尤其是在边界细节上。因此,需要一种噪声鲁棒的模型来解决这些问题。

Contribution: 主要贡献是提出UKANFormer,结合UKAN和Transformer,实现了在噪声标签下的高性能语义分割,并证明了模型设计可以弥补标签质量的不足。

Method: 基于UKAN架构,加入了全局-局部Transformer(GL-Trans)块,以同时提取全局语义结构和局部边界细节。

Result: 在实验中,UKANFormer在珊瑚类IoU和像素准确率分别达到67.00%和83.98%,优于传统基线模型,且预测结果比训练用的噪声标签更准确。

Insight: 研究挑战了数据质量直接限制模型性能的观点,展示了通过创新的架构设计可以显著改善噪声标签下的模型表现,为生态监测提供了新思路。

Abstract: Coral reefs are vital yet fragile ecosystems that require accurate large-scale mapping for effective conservation. Although global products such as the Allen Coral Atlas provide unprecedented coverage of global coral reef distri-bution, their predictions are frequently limited in spatial precision and semantic consistency, especially in regions requiring fine-grained boundary delineation. To address these challenges, we propose UKANFormer, a novel se-mantic segmentation model designed to achieve high-precision mapping under noisy supervision derived from Allen Coral Atlas. Building upon the UKAN architecture, UKANFormer incorporates a Global-Local Transformer (GL-Trans) block in the decoder, enabling the extraction of both global semantic structures and local boundary details. In experiments, UKANFormer achieved a coral-class IoU of 67.00% and pixel accuracy of 83.98%, outperforming conventional baselines under the same noisy labels setting. Remarkably, the model produces predictions that are visually and structurally more accurate than the noisy labels used for training. These results challenge the notion that data quality directly limits model performance, showing that architectural design can mitigate label noise and sup-port scalable mapping under imperfect supervision. UKANFormer provides a foundation for ecological monitoring where reliable labels are scarce.

[114] A Comprehensive Survey on World Models for Embodied AI

Xinqing Li,Xin He,Le Zhang,Yun Liu

Main category: cs.CV

TL;DR: 这篇综述提出了一个统一的框架,对世界模型在Embodied AI中的功能、时间建模和空间表示进行了分类,并总结了当前的数据资源、评估指标和开放挑战。

Details Motivation: Embodied AI需要能够感知、行动并预测行动对未来世界状态影响的智能体。世界模型作为内部模拟器,能够捕捉环境动态,支持感知、预测和决策。

Contribution: 提出了一个三轴分类法,涵盖了世界模型的功能性、时间建模和空间表示;系统化了数据资源和评估指标;提供了定量比较并总结了开放挑战。

Method: 采用分类法,将世界模型分为功能型、时间建模型和空间表示型,并对相关数据资源和评估指标进行系统化整理。

Result: 总结了当前世界模型的优缺点,指出了数据稀缺性和评估标准不足的问题。

Insight: 强调了未来研究中需要解决的核心问题,包括数据集的统一、评估指标的改进、计算效率与性能的权衡,以及长期时间一致性的建模挑战。

Abstract: Embodied AI requires agents that perceive, act, and anticipate how actions reshape future world states. World models serve as internal simulators that capture environment dynamics, enabling forward and counterfactual rollouts to support perception, prediction, and decision making. This survey presents a unified framework for world models in embodied AI. Specifically, we formalize the problem setting and learning objectives, and propose a three-axis taxonomy encompassing: (1) Functionality, Decision-Coupled vs. General-Purpose; (2) Temporal Modeling, Sequential Simulation and Inference vs. Global Difference Prediction; (3) Spatial Representation, Global Latent Vector, Token Feature Sequence, Spatial Latent Grid, and Decomposed Rendering Representation. We systematize data resources and metrics across robotics, autonomous driving, and general video settings, covering pixel prediction quality, state-level understanding, and task performance. Furthermore, we offer a quantitative comparison of state-of-the-art models and distill key open challenges, including the scarcity of unified datasets and the need for evaluation metrics that assess physical consistency over pixel fidelity, the trade-off between model performance and the computational efficiency required for real-time control, and the core modeling difficulty of achieving long-horizon temporal consistency while mitigating error accumulation. Finally, we maintain a curated bibliography at https://github.com/Li-Zn-H/AwesomeWorldModels.

[115] Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Erik Riise,Mehmet Onurcan Kaya,Dim P. Papadopoulos

Main category: cs.CV

TL;DR: 本文证明了视觉自回归模型(visual autoregressive models)在推理时间缩放方面优于扩散模型(diffusion models),通过beam search等技术显著提升文本到图像生成的效果。

Details Motivation: 大型语言模型通过推理时间搜索取得了显著进展,但在图像生成领域,类似技术难以实现同样效果。扩散模型的搜索策略效果有限,因此研究自回归模型的潜力。

Contribution: 研究发现离散、序列化的视觉自回归模型能通过beam search优化图像生成,其性能甚至超越更大规模的扩散模型。此外,分析了离散token空间对计算效率和推理能力的优势。

Method: 利用beam search等技术优化视觉自回归模型的推理过程,通过离散token空间的早期剪枝和计算重用提升效率。

Result: 实验表明,2B参数的视觉自回归模型在文本到图像生成任务中优于12B参数的扩散模型。

Insight: 模型架构(如离散token空间)对推理时间优化至关重要,而非仅仅是模型规模。自回归模型在速度和性能之间提供了更好的权衡。

Abstract: While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.

[116] WaMaIR: Image Restoration via Multiscale Wavelet Convolutions and Mamba-based Channel Modeling with Texture Enhancement

Shengyu Zhu,Fan,Fuxuan Zhang

Main category: cs.CV

TL;DR: WaMaIR提出了一种结合多尺度小波卷积和Mamba通道建模的图像复原框架,通过增强纹理细节实现了更优的复原效果。

Details Motivation: 现有CNN方法因感受野受限和缺乏通道特征建模,难以恢复精细纹理细节。为此提出WaMaIR,旨在扩大感受野并增强纹理重建能力。

Contribution: 1. 提出全局多尺度小波变换卷积(GMWTConvs)扩展感受野;2. 设计Mamba通道感知模块(MCAM)建模长程依赖;3. 引入多尺度纹理增强损失(MTELoss)优化纹理细节。

Method: 1. 使用GMWTConvs提取多尺度特征;2. 通过MCAM捕获通道间长程依赖;3. 结合MTELoss引导模型保留纹理结构。

Result: 实验表明WaMaIR优于现有方法,兼具高效计算和优越复原性能。

Insight: 结合小波变换和Mamba可以有效扩展模型感受野并增强纹理细节建模,同时多尺度损失函数对提升复原质量至关重要。

Abstract: Image restoration is a fundamental and challenging task in computer vision, where CNN-based frameworks demonstrate significant computational efficiency. However, previous CNN-based methods often face challenges in adequately restoring fine texture details, which are limited by the small receptive field of CNN structures and the lack of channel feature modeling. In this paper, we propose WaMaIR, which is a novel framework with a large receptive field for image perception and improves the reconstruction of texture details in restored images. Specifically, we introduce the Global Multiscale Wavelet Transform Convolutions (GMWTConvs) for expandding the receptive field to extract image features, preserving and enriching texture features in model inputs. Meanwhile, we propose the Mamba-Based Channel-Aware Module (MCAM), explicitly designed to capture long-range dependencies within feature channels, which enhancing the model sensitivity to color, edges, and texture information. Additionally, we propose Multiscale Texture Enhancement Loss (MTELoss) for image restoration to guide the model in preserving detailed texture structures effectively. Extensive experiments confirm that WaMaIR outperforms state-of-the-art methods, achieving better image restoration and efficient computational performance of the model.

[117] Region in Context: Text-condition Image editing with Human-like semantic reasoning

Thuy Phuong Vu,Dinh-Cuong Hoang,Minhhuy Le,Phan Xuan Tan

Main category: cs.CV

TL;DR: 该论文提出了一种名为”Region in Context”的文本条件图像编辑框架,通过多层次的语义对齐,实现了在人编辑图像时的全局上下文感知,从而生成更一致、更符合指令的结果。

Details Motivation: 现有的文本条件图像编辑方法通常孤立处理图像区域,缺乏对全局视觉和语义组成的考虑,导致编辑结果不一致或不自然。

Contribution: 提出了一种新颖的双重引导机制,结合全图像上下文和详细区域级描述,实现区域与全局语义的精确对齐。

Method: 通过大视觉语言模型生成场景级描述,并在区域级和全局级双重引导下进行图像编辑。

Result: 实验表明,该方法生成的编辑结果更一致、更符合指令要求。

Insight: 结合全局上下文和区域级语义可以显著提升文本条件图像编辑的质量和一致性。

Abstract: Recent research has made significant progress in localizing and editing image regions based on text. However, most approaches treat these regions in isolation, relying solely on local cues without accounting for how each part contributes to the overall visual and semantic composition. This often results in inconsistent edits, unnatural transitions, or loss of coherence across the image. In this work, we propose Region in Context, a novel framework for text-conditioned image editing that performs multilevel semantic alignment between vision and language, inspired by the human ability to reason about edits in relation to the whole scene. Our method encourages each region to understand its role within the global image context, enabling precise and harmonized changes. At its core, the framework introduces a dual-level guidance mechanism: regions are represented with full-image context and aligned with detailed region-level descriptions, while the entire image is simultaneously matched to a comprehensive scene-level description generated by a large vision-language model. These descriptions serve as explicit verbal references of the intended content, guiding both local modifications and global structure. Experiments show that it produces more coherent and instruction-aligned results. Code is available at: https://github.com/thuyvuphuong/Region-in-Context.git

[118] EMRRG: Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation

Mingzheng Zhang,Jinfeng Gao,Dan Xu,Jiangrui Yu,Yuhan Qiao,Lan Chen,Jin Tang,Xiao Wang

Main category: cs.CV

TL;DR: EMRRG提出了一种基于Mamba网络的高效微调方法,用于X光报告生成,通过Partial LoRA和SSM-based视觉骨干网络提升性能。

Details Motivation: 现有医疗报告生成模型主要依赖大型语言模型(LLM),而预训练视觉基础模型和高效微调技术的研究不足,且非Transformer架构(如Mamba)在医学领域的潜力未充分挖掘。

Contribution: 1) 提出了结合Mamba网络和Partial LoRA的高效微调框架EMRRG;2) 探索了SSM-based视觉骨干网络在医学报告生成中的应用;3) 在多个基准数据集上验证了方法的有效性。

Method: 1) 将X光图像分块并标记化;2) 使用SSM-based视觉骨干网络提取特征;3) 采用Partial LoRA微调;4) 结合LLM混合解码器生成报告。

Result: 在三个基准数据集上取得了优秀的结果,验证了方法的有效性。

Insight: 1) 非Transformer架构(如Mamba)在医学图像任务中具有潜力;2) Partial LoRA是一种高效的微调策略。

Abstract: X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence that can significantly reduce diagnostic burdens for clinicians and patient wait times. Existing MRG models predominantly rely on Large Language Models (LLMs) to improve report generation, with limited exploration of pre-trained vision foundation models or advanced fine-tuning techniques. Mainstream frameworks either avoid fine-tuning or utilize simplistic methods like LoRA, often neglecting the potential of enhancing cross-attention mechanisms. Additionally, while Transformer-based models dominate vision-language tasks, non-Transformer architectures, such as the Mamba network, remain underexplored for medical report generation, presenting a promising avenue for future research. In this paper, we propose EMRRG, a novel X-ray report generation framework that fine-tunes pre-trained Mamba networks using parameter-efficient methods. Specifically, X-ray images are divided into patches, tokenized, and processed by an SSM-based vision backbone for feature extraction, with Partial LoRA yielding optimal performance. An LLM with a hybrid decoder generates the medical report, enabling end-to-end training and achieving strong results on benchmark datasets. Extensive experiments on three widely used benchmark datasets fully validated the effectiveness of our proposed strategies for the X-ray MRG. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.

[119] GS2POSE: Marry Gaussian Splatting to 6D Object Pose Estimation

Junbo Li,Weimin Yuan,Yinuo Wang,Yue Zeng,Shihao Shu,Cai Meng,Xiangzhi Bai

Main category: cs.CV

TL;DR: GS2POSE是一种新颖的6D物体姿态估计方法,结合高斯抛光和位姿回归算法,解决了纹理缺失和光照变化的挑战,并在多个数据集上表现优于现有方法。

Details Motivation: 现有基于2D-3D特征对应的6D姿态估计方法难以处理纹理缺失物体和光照变化问题,需要一种更鲁棒的方法。

Contribution: 提出了GS2POSE,结合高斯抛光(3DGS)和Bundle Adjustment思想,通过位姿可微分渲染和颜色参数更新提升姿态估计精度。

Method: 利用Lie代数扩展3DGS,构建位姿可微分渲染流程,迭代优化姿态;同时更新3DGS模型的颜色参数以适应光照变化。

Result: 在T-LESS、LineMod-Occlusion和LineMod数据集上分别提升了1.4%、2.8%和2.5%的精度。

Insight: 通过结合3DGS和位姿优化,GS2POSE展示了在复杂场景下提升姿态估计鲁棒性的潜力。

Abstract: Accurate 6D pose estimation of 3D objects is a fundamental task in computer vision, and current research typically predicts the 6D pose by establishing correspondences between 2D image features and 3D model features. However, these methods often face difficulties with textureless objects and varying illumination conditions. To overcome these limitations, we propose GS2POSE, a novel approach for 6D object pose estimation. GS2POSE formulates a pose regression algorithm inspired by the principles of Bundle Adjustment (BA). By leveraging Lie algebra, we extend the capabilities of 3DGS to develop a pose-differentiable rendering pipeline, which iteratively optimizes the pose by comparing the input image to the rendered image. Additionally, GS2POSE updates color parameters within the 3DGS model, enhancing its adaptability to changes in illumination. Compared to previous models, GS2POSE demonstrates accuracy improvements of 1.4%, 2.8% and 2.5% on the T-LESS, LineMod-Occlusion and LineMod datasets, respectively.

[120] Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features

Shihao Ji,Zihui Song

Main category: cs.CV

TL;DR: 论文提出了一种无训练的视频理解框架,结合预训练视觉语言模型(VLM)的语义先验与经典机器学习算法,通过自监督时空聚类实现零样本视频分析。

Details Motivation: 现有视频理解模型依赖大量标注数据和任务特定训练,成本高且扩展性差。本文旨在利用预训练VLM的零样本推理能力,避免训练过程,直接应用于视频分析。

Contribution: 1. 提出首个无训练的视频理解框架;2. 将视频理解重构为自监督时空聚类问题;3. 结合VLM语义特征与经典机器学习(如KTS和密度聚类)实现结构化视频摘要。

Method: 1. 用预训练VLM的冻结视觉编码器提取视频语义特征;2. 使用KTS算法分割特征流;3. 通过密度聚类识别语义场景;4. 选取关键帧并生成文本描述。

Result: 框架实现了零样本视频结构化分析,能够自动生成多模态摘要,无需任何训练数据或微调。

Insight: 预训练模型的语义先验与经典机器学习算法的结合,为无训练视频理解提供了一种高效且可解释的解决方案。

Abstract: The remarkable zero-shot reasoning capabilities of large-scale Visual Language Models (VLMs) on static images have yet to be fully translated to the video domain. Conventional video understanding models often rely on extensive, task-specific training on annotated datasets, a process that is both costly and limited in scalability. This paper introduces a novel, training-free framework for video understanding that circumvents end-to-end training by synergistically combining the rich semantic priors of pre-trained VLMs with classic machine learning algorithms for pattern discovery. Our core idea is to reframe video understanding as a self-supervised spatio-temporal clustering problem within a high-dimensional semantic feature space. The proposed pipeline first transforms a video stream into a semantic feature trajectory using the frozen visual encoder of a pre-trained VLM. Subsequently, we employ Kernel Temporal Segmentation (KTS), a robust machine learning technique, to partition the continuous feature stream into discrete, semantically coherent event segments. These segments are then subjected to unsupervised density-based clustering to identify recurring macroscopic scenes and themes throughout the video. By selecting representative keyframes from each discovered cluster and leveraging the VLM’s generative capabilities for textual description, our framework automatically produces a structured, multi-modal summary of the video content. This approach provides an effective, interpretable, and model-agnostic pathway for zero-shot, automated structural analysis of video content.

[121] Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

Jiazhen Liu,Long Chen

Main category: cs.CV

TL;DR: 提出了LENS方法,通过轻量级可训练头部分析冻结多模态大语言模型的注意力图,实现像素级分割能力,无需调整模型参数,保留了模型的泛化能力。

Details Motivation: 多模态大语言模型(MLLMs)需要在统一框架下集成多种视觉能力,但传统分割方法需微调模型,破坏其泛化能力。LENS旨在设计一种无需微调的即插即用分割方案。

Contribution: 提出了LENS方法,一种基于注意力图关键点提取的即插即用分割方案,保留了MLLMs的泛化能力,性能优于或媲美微调方法。

Method: LENS通过训练一个轻量级头部,从冻结MLLMs的注意力图中提取关键点,并将其转换为掩码解码器兼容的特征,实现像素级分割。

Result: 实验表明LENS的分割性能与微调方法相当或更好,同时完全保留了MLLMs的原始能力。

Insight: 通过分离模型训练与分割任务,LENS为多模态模型的扩展提供了一种高效且不影响泛化的新范式。

Abstract: Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model’s output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce LENS (Leveraging kEypoiNts for MLLMs’ Segmentation), a novel plug-and-play solution. LENS attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, LENS extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: LENS achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM’s generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of LENS establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.

[122] Unsupervised Monocular Road Segmentation for Autonomous Driving via Scene Geometry

Sara Hatami Rostami,Behrooz Nasihatkon

Main category: cs.CV

TL;DR: 本篇論文提出了一種完全無監督的二值道路分割方法,通過場景幾何和時間一致性來區分道路與非道路區域,避免了對昂貴標註數據的依賴。

Details Motivation: 傳統道路分割方法依賴大量人工標註數據,成本高昂且難以擴展。論文旨在通過無監督方法解決這一問題,利用場景幾何和時間線索來實現高效且可擴展的道路分割。

Contribution: 論文的主要貢獻是提出了一種無監督方法,結合幾何先驗和時間一致性來生成道路分割標籤,並通過互相信息最大化進行優化,實現了高準確率和穩定性。

Method: 方法分為兩個階段:首先利用幾何先驗生成弱標籤(地平線以上為非道路,預定義四邊形區域為道路),然後通過時間一致性(特徵點跟踪和互相信息最大化)進行標籤細化。

Result: 在Cityscapes數據集上,模型達到了0.82的IoU,表現出高準確性和簡單設計的優勢。

Insight: 研究表明,結合幾何約束和時間一致性可實現高效且可擴展的無監督道路分割,為自動駕駛中的低依賴標註數據方法提供了新思路。

Abstract: This paper presents a fully unsupervised approach for binary road segmentation (road vs. non-road), eliminating the reliance on costly manually labeled datasets. The method leverages scene geometry and temporal cues to distinguish road from non-road regions. Weak labels are first generated from geometric priors, marking pixels above the horizon as non-road and a predefined quadrilateral in front of the vehicle as road. In a refinement stage, temporal consistency is enforced by tracking local feature points across frames and penalizing inconsistent label assignments using mutual information maximization. This enhances both precision and temporal stability. On the Cityscapes dataset, the model achieves an Intersection-over-Union (IoU) of 0.82, demonstrating high accuracy with a simple design. These findings demonstrate the potential of combining geometric constraints and temporal consistency for scalable unsupervised road segmentation in autonomous driving.

[123] Personalized Image Filter: Mastering Your Photographic Style

Chengxuan Zhu,Shuchen Weng,Jiacong Fang,Peixuan Zhang,Si Li,Chao Xu,Boxin Shi

Main category: cs.CV

TL;DR: 论文提出了一种个性化图像滤镜(PIF),基于预训练的文本-图像扩散模型,结合文本反转技术,能够有效学习和迁移多样化的摄影风格,解决了以往方法在概念学习和内容保留上的不足。

Details Motivation: 摄影风格是著名摄影师作品的核心魅力,但以往方法难以从参考图像中学习有意义的摄影概念或保留内容图像的原始信息,PIF旨在解决这些问题。

Contribution: 1. 提出了PIF框架,结合扩散模型和文本反转技术;2. 能够学习摄影风格的平均外观并根据文本提示调整;3. 在多样风格提取和迁移中表现优异。

Method: 基于预训练的文本-图像扩散模型,利用文本反转技术优化摄影概念的提示词,从而实现风格的提取和迁移。

Result: PIF在提取和迁移多种摄影风格方面展现出卓越性能,超越了以往方法的效果。

Insight: 通过结合扩散模型的生成先验和文本反转技术,PIF为个性化风格学习和迁移提供了新的解决方案。

Abstract: Photographic style, as a composition of certain photographic concepts, is the charm behind renowned photographers. But learning and transferring photographic style need a profound understanding of how the photo is edited from the unknown original appearance. Previous works either fail to learn meaningful photographic concepts from reference images, or cannot preserve the content of the content image. To tackle these issues, we proposed a Personalized Image Filter (PIF). Based on a pretrained text-to-image diffusion model, the generative prior enables PIF to learn the average appearance of photographic concepts, as well as how to adjust them according to text prompts. PIF then learns the photographic style of reference images with the textual inversion technique, by optimizing the prompts for the photographic concepts. PIF shows outstanding performance in extracting and transferring various kinds of photographic style. Project page: https://pif.pages.dev/

[124] An RGB-D Image Dataset for Lychee Detection and Maturity Classification for Robotic Harvesting

Zhenpeng Zhang,Yi Wang,Shanglei Chai,Yingying Liu,Zekai Xie,Wenhao Huang,Pengyu Li,Zipei Luo,Dajiang Lu,Yibin Tian

Main category: cs.CV

TL;DR: 该论文构建了一个高质量、开源的红外-深度(RGB-D)荔枝数据集,用于支持视觉驱动的荔枝采摘机器人开发,涵盖了多种品种、天气条件和成熟度阶段的数据。

Details Motivation: 目前缺乏高质量、开源且标注一致的荔枝数据集,限制了基于视觉的荔枝采摘机器人技术的发展,因此作者构建了一个全面的数据集来解决这一问题。

Contribution: 论文的主要贡献是提供了一个包含11,414张RGB-D图像的荔枝数据集,涵盖多种荔枝品种、天气条件和成熟度阶段,并进行了详细的统计分析和模型验证。

Method: 通过采集878张原始RGB图像,生成8,780张增强RGB图像和1,756张深度图像,标注了9,658对荔枝检测和成熟度分类标签。数据标注由三人独立完成,并由第四人验证以确保一致性。

Result: 数据集通过三种深度学习模型进行了实验验证,证明了其有效性。数据集已公开供学术研究使用。

Insight: 高质量且多样化的数据集是推动农业机器人技术发展的关键,尤其是在复杂自然环境中。

Abstract: Lychee is a high-value subtropical fruit. The adoption of vision-based harvesting robots can significantly improve productivity while reduce reliance on labor. High-quality data are essential for developing such harvesting robots. However, there are currently no consistently and comprehensively annotated open-source lychee datasets featuring fruits in natural growing environments. To address this, we constructed a dataset to facilitate lychee detection and maturity classification. Color (RGB) images were acquired under diverse weather conditions, and at different times of the day, across multiple lychee varieties, such as Nuomici, Feizixiao, Heiye, and Huaizhi. The dataset encompasses three different ripeness stages and contains 11,414 images, consisting of 878 raw RGB images, 8,780 augmented RGB images, and 1,756 depth images. The images are annotated with 9,658 pairs of lables for lychee detection and maturity classification. To improve annotation consistency, three individuals independently labeled the data, and their results were then aggregated and verified by a fourth reviewer. Detailed statistical analyses were done to examine the dataset. Finally, we performed experiments using three representative deep learning models to evaluate the dataset. It is publicly available for academic

[125] From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display

Xiangyu Mu,Dongliang Zhou,Jie Hou,Haijun Zhang,Weili Guan

Main category: cs.CV

TL;DR: M2HVideo是一个基于姿态感知和身份保持的视频生成框架,旨在将模特视频转换为逼真的人体视频,解决头部与身体运动不对齐和身份漂移的问题。

Details Motivation: 由于模特展示服装时缺乏真实感和细节表现,无法充分展示在线时尚的效果,因此需要一种方法将模特的视频转换为真实的、细节丰富的人体视频。

Contribution: 1. 提出了M2HVideo框架,解决了头部与身体运动不对齐和身份漂移的问题;2. 设计了动态姿态感知头部编码器和分布感知适配器;3. 引入了镜像损失和DDIM-based单步去噪方法。

Method: 1. 动态姿态感知头部编码器融合面部语义与身体姿态;2. 通过DDIM-based单步去噪应用像素空间的镜像损失;3. 分布感知适配器对齐身份与服装特征的统计分布。

Result: 在多个数据集上验证了M2HVideo在服装一致性、身份保持和视频保真度方面的优越性能。

Insight: 结合姿态感知和身份保持技术与扩散模型的优势,可以显著提升视频生成的真实感和细节表现。

Abstract: Mannequin-based clothing displays offer a cost-effective alternative to real-model showcases for online fashion presentation, but lack realism and expressive detail. To overcome this limitation, we introduce a new task called mannequin-to-human (M2H) video generation, which aims to synthesize identity-controllable, photorealistic human videos from footage of mannequins. We propose M2HVideo, a pose-aware and identity-preserving video generation framework that addresses two key challenges: the misalignment between head and body motion, and identity drift caused by temporal modeling. In particular, M2HVideo incorporates a dynamic pose-aware head encoder that fuses facial semantics with body pose to produce consistent identity embeddings across frames. To address the loss of fine facial details due to latent space compression, we introduce a mirror loss applied in pixel space through a denoising diffusion implicit model (DDIM)-based one-step denoising. Additionally, we design a distribution-aware adapter that aligns statistical distributions of identity and clothing features to enhance temporal coherence. Extensive experiments on the UBC fashion dataset, our self-constructed ASOS dataset, and the newly collected MannequinVideos dataset captured on-site demonstrate that M2HVideo achieves superior performance in terms of clothing consistency, identity preservation, and video fidelity in comparison to state-of-the-art methods.

[126] ArmFormer: Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification

Akhila Kambhatla,Taminul Islam,Khaled R Ahmed

Main category: cs.CV

TL;DR: ArmFormer是一种轻量级基于Transformer的语义分割框架,结合CBAM和MixVisionTransformer架构,实现了高效的实时多类武器分割和分类,适合边缘设备部署。

Details Motivation: 武器相关暴力威胁日益严重,需要一种能够在实时安全应用中实现像素级精确检测的系统。传统方法仅提供粗略的边界框定位,无法满足精细分割需求,且现有语义分割模型无法兼顾准确性和计算效率。

Contribution: 提出ArmFormer,轻量级Transformer架构,结合CBAM与MixVisionTransformer,实现了高精度和多类武器分割。

Method: 采用CBAM增强的编码器主干和注意力集成的hamburger解码器,支持五种类别的分割(手枪、步枪、刀、左轮枪和人)。

Result: 在80.64% mIoU和89.13% mFscore下达到SOTA性能,实时推理速度82.26 FPS,参数量仅为3.66M,计算量4.886G FLOPs。

Insight: ArmFormer展示了轻量级Transformer模型在边缘设备上的潜力,平衡了准确性和计算效率,适用于分布式安全基础设施。

Abstract: The escalating threat of weapon-related violence necessitates automated detection systems capable of pixel-level precision for accurate threat assessment in real-time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine-grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight transformer-based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource-constrained edge devices. Our approach combines CBAM-enhanced encoder backbone with attention-integrated hamburger decoder to enable multi-class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.

[127] Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding

Yudan Ren,Xinlong Wang,Kexin Wang,Tian Xia,Zihan Ma,Zhaowei Li,Xiangrong Bi,Xiao Li,Xiaowei He

Main category: cs.CV

TL;DR: 该论文提出了一种新颖的神经元级分析框架,通过结合fMRI脑活动数据,研究了视觉语言模型(VLMs)的多模态信息处理机制,揭示了这些模型在人脑类似的分层处理模式上的表现。

Details Motivation: 当前对人工神经网络(ANNs)与人脑处理之间关系的理解存在局限性:单模态ANN研究无法捕捉人脑的多模态处理能力,而多模态ANN研究则过于关注高层模型输出,忽视了个体神经元的作用。本文旨在填补这一空白。

Contribution: 1. 提出了结合fMRI的神经元级分析框架,用于研究VLMs的多模态处理机制。2. 揭示了ANNs与生物神经元(BNs)在功能网络中的共享表征机制。3. 发现了ANNs和BNs的功能冗余和极性模式相似性。4. 对比了不同架构VLMs(CLIP和METER)对BNs的影响。

Method: 通过结合细粒度的人工神经元(AN)分析与fMRI体素编码,对CLIP和METER两种架构的VLMs进行神经元级分析,研究它们与生物神经元的关联。

Result: 1. ANs能预测多种功能网络中BNs的活动,表明共享表征机制。2. ANs和BNs表现出功能冗余。3. ANs的极性模式与BNs类似。4. CLIP和METER的架构对BNs的影响不同。

Insight: 视觉语言模型在多模态处理中展现了与人脑类似的分层模式和神经元级特性,架构设计对模型的生物学相似性有重要影响。

Abstract: While brain-inspired artificial intelligence(AI) has demonstrated promising results, current understanding of the parallels between artificial neural networks (ANNs) and human brain processing remains limited: (1) unimodal ANN studies fail to capture the brain’s inherent multimodal processing capabilities, and (2) multimodal ANN research primarily focuses on high-level model outputs, neglecting the crucial role of individual neurons. To address these limitations, we propose a novel neuron-level analysis framework that investigates the multimodal information processing mechanisms in vision-language models (VLMs) through the lens of human brain activity. Our approach uniquely combines fine-grained artificial neuron (AN) analysis with fMRI-based voxel encoding to examine two architecturally distinct VLMs: CLIP and METER. Our analysis reveals four key findings: (1) ANs successfully predict biological neurons (BNs) activities across multiple functional networks (including language, vision, attention, and default mode), demonstrating shared representational mechanisms; (2) Both ANs and BNs demonstrate functional redundancy through overlapping neural representations, mirroring the brain’s fault-tolerant and collaborative information processing mechanisms; (3) ANs exhibit polarity patterns that parallel the BNs, with oppositely activated BNs showing mirrored activation trends across VLM layers, reflecting the complexity and bidirectional nature of neural information processing; (4) The architectures of CLIP and METER drive distinct BNs: CLIP’s independent branches show modality-specific specialization, whereas METER’s cross-modal design yields unified cross-modal activation, highlighting the architecture’s influence on ANN brain-like properties. These results provide compelling evidence for brain-like hierarchical processing in VLMs at the neuronal level.

[128] Class-N-Diff: Classification-Induced Diffusion Model Can Make Fair Skin Cancer Diagnosis

Nusrat Munia,Abdullah Imran

Main category: cs.CV

TL;DR: 论文提出了一种分类诱导扩散模型(Class-N-Diff),通过在扩散模型中集成分类器来同时生成和分类皮肤镜图像,提高了图像生成的质量和多样性,同时提升了分类性能。

Details Motivation: 传统的分类条件生成模型在生成特定医学类别的图像时表现不佳,限制了其在皮肤癌诊断等应用中的实用性。该论文旨在解决这一问题。

Contribution: 主要贡献是提出了一种分类诱导扩散模型(Class-N-Diff),能够同时生成高质量皮肤镜图像并进行分类,提高了生成图像的多样性和真实性。

Method: 方法是在扩散模型中集成一个分类器,利用分类器的类别条件引导图像生成,从而实现对类别条件图像的更好控制。

Result: 结果表明,生成的图像更加真实和多样,同时分类器的性能也有所提升,证明其对下游诊断任务的有效性。

Insight: 该研究表明,将分类器集成到扩散模型中不仅可以改善图像生成质量,还能提升分类性能,为医学图像生成和诊断任务提供了新思路。

Abstract: Generative models, especially Diffusion Models, have demonstrated remarkable capability in generating high-quality synthetic data, including medical images. However, traditional class-conditioned generative models often struggle to generate images that accurately represent specific medical categories, limiting their usefulness for applications such as skin cancer diagnosis. To address this problem, we propose a classification-induced diffusion model, namely, Class-N-Diff, to simultaneously generate and classify dermoscopic images. Our Class-N-Diff model integrates a classifier within a diffusion model to guide image generation based on its class conditions. Thus, the model has better control over class-conditioned image synthesis, resulting in more realistic and diverse images. Additionally, the classifier demonstrates improved performance, highlighting its effectiveness for downstream diagnostic tasks. This unique integration in our Class-N-Diff makes it a robust tool for enhancing the quality and utility of diffusion model-based synthetic dermoscopic image generation. Our code is available at https://github.com/Munia03/Class-N-Diff.

[129] Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li,Zheyuan Liu,Qihui Zhang,Bin Lin,Shenghai Yuan,Zhiyuan Yan,Yang Ye,Wangbo Yu,Yuwei Niu,Li Yuan

Main category: cs.CV

TL;DR: 本文提出了一种基于指令的图像编辑框架Edit-R1,结合Diffusion Negative-aware Finetuning(DiffusionNFT)和MLLM隐含反馈,解决了传统方法过拟合和缺乏通用奖励模型的问题,实现了SOTA性能。

Details Motivation: 传统基于指令的图像编辑模型通过监督微调训练,容易过拟合标注模式,限制了其在训练分布外的泛化能力。

Contribution: 提出了Edit-R1框架,结合DiffusionNFT和MLLM隐含反馈,解决了泛化性和奖励模型缺失问题;设计了一种低方差分组过滤机制以减少MLLM评分噪声。

Method: 1. DiffusionNFT:一种无似然策略优化方法,支持高阶采样和高效训练;2. MLLM作为统一的无训练奖励模型,提供细粒度反馈;3. 低方差分组过滤机制优化稳定性。

Result: 在ImgEdit和GEdit-Bench基准上分别达到4.49和7.83分,表现SOTA;框架通用性强,适用于多种基础模型。

Insight: DiffusionNFT和MLLM的结合不仅提升了性能,还展示了模型无关的优势,为指令驱动的编辑任务提供了新思路。

Abstract: Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.

[130] Contrail-to-Flight Attribution Using Ground Visible Cameras and Flight Surveillance Data

Ramon Dalmau,Gabriel Jarry,Philippe Very

Main category: cs.CV

TL;DR: 论文提出了一种利用地面可见光摄像头和飞行监测数据进行凝结尾迹飞行源追溯的模块化框架,用于验证和标定凝结尾迹模型。

Details Motivation: 凝结尾迹是航空非CO2效应的主要贡献者,但其模型验证需溯源至生成航班。卫星分辨率不足,地面摄像头提供了高时空分辨率的解决方案。

Contribution: 提出模块化框架,结合地面摄像头数据与飞行监测数据,实现凝结尾迹与航班的精确匹配,支持多种几何表示和距离度量。

Method: 利用GVCCS数据集,设计框架支持几何表示多样性、时间平滑和概率分配策略,实现凝结尾迹与航班的关联。

Result: 框架为凝结尾迹溯源建立了强基线,为未来研究提供了灵活的方法基础。

Insight: 地面摄像头数据可弥补卫星数据的分辨率限制,模块化设计提升了方法的通用性和可扩展性。

Abstract: Aviation’s non-CO2 effects, particularly contrails, are a significant contributor to its climate impact. Persistent contrails can evolve into cirrus-like clouds that trap outgoing infrared radiation, with radiative forcing potentially comparable to or exceeding that of aviation’s CO2 emissions. While physical models simulate contrail formation, evolution and dissipation, validating and calibrating these models requires linking observed contrails to the flights that generated them, a process known as contrail-to-flight attribution. Satellite-based attribution is challenging due to limited spatial and temporal resolution, as contrails often drift and deform before detection. In this paper, we evaluate an alternative approach using ground-based cameras, which capture contrails shortly after formation at high spatial and temporal resolution, when they remain thin, linear, and visually distinct. Leveraging the ground visible camera contrail sequences (GVCCS) dataset, we introduce a modular framework for attributing contrails observed using ground-based cameras to theoretical contrails derived from aircraft surveillance and meteorological data. The framework accommodates multiple geometric representations and distance metrics, incorporates temporal smoothing, and enables flexible probability-based assignment strategies. This work establishes a strong baseline and provides a modular framework for future research in linking contrails to their source flight.

[131] Beyond RGB: Leveraging Vision Transformers for Thermal Weapon Segmentation

Akhila Kambhatla,Ahmed R Khaled

Main category: cs.CV

TL;DR: 该论文研究了在热成像武器分割任务中应用视觉Transformer模型,相比传统CNN,ViT在全局上下文建模和细粒度结构捕捉方面表现更优,实验结果表明SegFormer等Transformer架构在性能和速度上均有显著提升。

Details Motivation: 热成像武器分割在低光和视觉遮挡条件下对安全监控至关重要,但传统CNN在处理长距离依赖和精细结构时表现有限。尽管ViT在RGB分割中表现优异,但其在热成像领域的潜力尚未充分探索。

Contribution: 论文首次系统地评估了四种Transformer架构(SegFormer、DeepLabV3+、SegNeXt、Swin Transformer)在热成像武器分割任务上的性能,并提出了一个新的自动标注数据集。

Method: 使用MMSegmentation框架,采用标准数据增强策略训练和比较四种Transformer模型。数据集包含9,711张热成像图像,通过SAM2自动标注。

Result: SegFormer-b5达到最高mIoU(94.15%)和像素准确率(97.04%),SegFormer-b0速度最快(98.32 FPS)。所有Transformer模型在低光和遮挡条件下均表现出强大的泛化能力。

Insight: 视觉Transformer在热成像分割任务中具有显著优势,能够灵活平衡精度与速度,适用于多样化的实时安全应用。

Abstract: Thermal weapon segmentation is crucial for surveillance and security applications, enabling robust detection under lowlight and visually obscured conditions where RGB-based systems fail. While convolutional neural networks (CNNs) dominate thermal segmentation literature, their ability to capture long-range dependencies and fine structural details is limited. Vision Transformers (ViTs), with their global context modeling capabilities, have achieved state-of-the-art results in RGB segmentation tasks, yet their potential in thermal weapon segmentation remains underexplored. This work adapts and evaluates four transformer-based architectures SegFormer, DeepLabV3+, SegNeXt, and Swin Transformer for binary weapon segmentation on a custom thermal dataset comprising 9,711 images collected from real world surveillance videos and automatically annotated using SAM2. We employ standard augmentation strategies within the MMSegmentation framework to ensure robust model training and fair architectural comparison. Experimental results demonstrate significant improvements in segmentation performance: SegFormer-b5 achieves the highest mIoU (94.15%) and Pixel Accuracy (97.04%), while SegFormer-b0 provides the fastest inference speed (98.32 FPS) with competitive mIoU (90.84%). SegNeXt-mscans offers balanced performance with 85.12 FPS and 92.24% mIoU, and DeepLabV3+ R101-D8 reaches 92.76% mIoU at 29.86 FPS. The transformer architectures demonstrate robust generalization capabilities for weapon detection in low-light and occluded thermal environments, with flexible accuracy-speed trade-offs suitable for diverse real-time security applications.

[132] Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

Chenxu Li,Zhicai Wang,Yuan Sheng,Xingyu Zhu,Yanbin Hao,Xiang Wang

Main category: cs.CV

TL;DR: 该论文提出了Res-Bench,一个评估多模态大语言模型(MLLMs)在不同输入分辨率下性能稳定性的基准。通过12种分辨率级别和6种核心能力维度的14,400个样本,设计了新颖的评估框架和鲁棒性指标,系统性分析了模型的任务鲁棒性、预处理策略和微调效果。

Details Motivation: 现有评估范式主要关注语义性能,忽略了分辨率鲁棒性,即模型在不同输入分辨率下性能的稳定性。这一缺陷可能限制MLLMs在实际动态分辨率场景中的应用。

Contribution: 1. 提出了Res-Bench基准;2. 设计了新的评估框架和鲁棒性指标(Spearman相关性和ACE/RCE);3. 系统性分析了模型的任务鲁棒性、预处理策略和微调效果。

Method: 1. 构建包含14,400个样本的基准数据集;2. 引入Spearman相关性、ACE/RCE等鲁棒性指标;3. 分析模型在不同分辨率下的性能稳定性、预处理策略和微调方法。

Result: 通过大规模评估,量化了当前MLLMs在动态分辨率输入下的鲁棒性问题,揭示了预处理和微调对性能稳定性的影响。

Insight: 1. 分辨率鲁棒性是MLLMs实际部署中的关键挑战;2. 预处理策略(如填充和超分辨率)和微调可显著提升稳定性;3. 传统准确性指标不足以全面评估鲁棒性。

Abstract: Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman’s correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

[133] Foundation Models in Medical Image Analysis: A Systematic Review and Meta-Analysis

Praveenbalaji Rajendran,Mojtaba Safari,Wenfeng He,Mingzhe Hu,Shansong Wang,Jun Zhou,Xiaofeng Yang

Main category: cs.CV

TL;DR: 该论文系统地综述了医学图像分析中基础模型(FMs)的应用,并通过元分析量化了发展趋势和挑战。

Details Motivation: 目前医学图像分析中的基础模型研究虽然发展迅速,但缺乏统一的综述和系统分析,难以全面了解其架构、训练范式及临床应用的发展脉络。

Contribution: 论文提供了对医学图像分析中基础模型的全面分类和定量元分析,揭示了数据集利用和应用领域的时间趋势,同时讨论了关键挑战和未来研究方向。

Method: 研究通过系统综述和定量元分析,将医学图像分析中的FMs分为视觉专用和视觉-语言两类,并分析了训练策略、下游任务及技术趋势。

Result: 研究发现视觉-语言FMs在跨模态任务中表现优异,但面临如领域适应、计算资源限制等问题,提出了如联邦学习等解决方案。

Insight: 基础模型在医学图像分析中展现出强大的泛化能力,但需进一步解决解释性和临床集成问题,以推动实际应用。

Abstract: Recent advancements in artificial intelligence (AI), particularly foundation models (FMs), have revolutionized medical image analysis, demonstrating strong zero- and few-shot performance across diverse medical imaging tasks, from segmentation to report generation. Unlike traditional task-specific AI models, FMs leverage large corpora of labeled and unlabeled multimodal datasets to learn generalized representations that can be adapted to various downstream clinical applications with minimal fine-tuning. However, despite the rapid proliferation of FM research in medical imaging, the field remains fragmented, lacking a unified synthesis that systematically maps the evolution of architectures, training paradigms, and clinical applications across modalities. To address this gap, this review article provides a comprehensive and structured analysis of FMs in medical image analysis. We systematically categorize studies into vision-only and vision-language FMs based on their architectural foundations, training strategies, and downstream clinical tasks. Additionally, a quantitative meta-analysis of the studies was conducted to characterize temporal trends in dataset utilization and application domains. We also critically discuss persistent challenges, including domain adaptation, efficient fine-tuning, computational constraints, and interpretability along with emerging solutions such as federated learning, knowledge distillation, and advanced prompting. Finally, we identify key future research directions aimed at enhancing the robustness, explainability, and clinical integration of FMs, thereby accelerating their translation into real-world medical practice.

[134] One-step Diffusion Models with Bregman Density Ratio Matching

Yuanzhi Zhu,Eleftherios Tsonis,Lucas Degeorge,Vicky Kalogeiton

Main category: cs.CV

TL;DR: 论文提出了一种基于Bregman散度的密度比匹配框架Di-Bregman,用于加速扩散模型的采样过程,使其能够通过单步生成高质量结果。

Details Motivation: 现有的扩散和流模型因多步采样导致计算成本高昂,而蒸馏方法缺乏统一的理論基础。论文旨在提供一个理论框架,高效地将多步扩散模型蒸馏为单步生成模型。

Contribution: 提出了Di-Bregman框架,通过Bregman散度的密度比匹配统一多种现有目标,实现了扩散模型的高效单步蒸馏。

Method: 利用Bregman散度构建密度比匹配目标,将多步扩散模型的采样过程蒸馏为单步生成模型。

Result: 在CIFAR-10和文本到图像生成任务上,Di-Bregman在单步FID(Frechet Inception Distance)上优于反向KL蒸馏,同时保持了视觉保真度。

Insight: Bregman密度比匹配是一种理论完备且实用的方法,可用于单步生成高效的扩散模型结果。

Abstract: Diffusion and flow models achieve high generative quality but remain computationally expensive due to slow multi-step sampling. Distillation methods accelerate them by training fast student generators, yet most existing objectives lack a unified theoretical foundation. In this work, we propose Di-Bregman, a compact framework that formulates diffusion distillation as Bregman divergence-based density-ratio matching. This convex-analytic view connects several existing objectives through a common lens. Experiments on CIFAR-10 and text-to-image generation demonstrate that Di-Bregman achieves improved one-step FID over reverse-KL distillation and maintains high visual fidelity compared to the teacher model. Our results highlight Bregman density-ratio matching as a practical and theoretically-grounded route toward efficient one-step diffusion generation.

[135] CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams

Junhao Zhao,Zishuai Liu,Ruili Fang,Jin Lu,Linghan Zhang,Fei Dou

Main category: cs.CV

TL;DR: CARE提出了一种名为对比对齐(Contrastive Alignment)的端到端框架,用于从事件触发的传感器流中识别日常活动(ADLs)。该方法通过序列-图像对比对齐(SICA)联合优化表征学习和分类任务,结合序列编码和图像表示的优势,提高了性能和鲁棒性。

Details Motivation: 现有方法在表征级别存在局限性:序列方法对噪声敏感且缺乏空间意识,图像方法则损失时间动态和传感器布局信息。简单融合无法充分利用两者的互补优势,因此需要一种新的对齐方法。

Contribution: CARE通过序列-图像对比对齐(SICA)联合优化表征学习和分类任务,结合时间感知的序列编码和空间频率敏感的图像表示,实现了跨表示对齐和任务特定的判别性。

Method: CARE集成时间感知的序列编码、空间频率敏感的图像表示,并采用联合对比-分类目标进行端到端学习。

Result: 在三个CASAS数据集上达到最优性能(Milan 89.8%,Cairo 88.9%,Kyoto7 73.3%),并展示了传感器故障和布局变化的鲁棒性。

Insight: CARE表明,通过对比对齐可以有效结合序列和图像表示的互补优势,为智能家居中的可靠ADL识别提供了一种新思路。

Abstract: The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.

[136] Training-free Online Video Step Grounding

Luca Zanella,Massimiliano Mancini,Yiming Wang,Alessio Tonioni,Elisa Ricci

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练的视频步骤定位(VSG)方法,利用大型多模态模型(LMMs)的零样本能力在线预测视频中的步骤,并提出贝叶斯滤波方法BaGLM进一步提升性能。

Details Motivation: 传统的VSG方法需要标注的训练数据且为离线处理,成本高且无法适应在线决策场景。因此,作者探索无需训练和在线处理的方法。

Contribution: 1) 首次提出无需训练的在线VSG方法;2) 开发BaGLM,结合贝叶斯滤波和大型语言模型的知识提升性能;3) 在三个数据集上优于基于训练的方法。

Method: 利用LMMs零样本预测步骤,引入BaGLM通过贝叶斯滤波结合步骤转移矩阵和进度估计优化预测。

Result: BaGLM在无需训练且在线处理的条件下,性能超越现有的基于训练的离线方法。

Insight: 大型多模态模型的零样本能力强,结合贝叶斯滤波可显著提升在线任务的性能。

Abstract: Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.

[137] An empirical study of the effect of video encoders on Temporal Video Grounding

Ignacio M. De la Jara,Cristian Rodriguez-Opazo,Edison Marrese-Taylor,Felipe Bravo-Marquez

Main category: cs.CV

TL;DR: 本文通过实证研究探讨了视频编码器对时序视频定位任务的影响,发现不同编码器对模型性能有显著差异,并揭示了特征互补的可能性。

Details Motivation: 时序视频定位是计算机视觉中的基础任务,但现有研究集中在少数几种视频表征上,可能导致长期的结构过拟合。本文旨在探究不同视频特征对任务的影响。

Contribution: 1. 提出了对不同视频编码器的实证研究;2. 提取了三种基准数据集的特征,发现性能差异和特征互补的潜力。

Method: 使用基于CNN、时序推理和Transformer的视频编码器提取Charades-STA、ActivityNet-Captions和YouCookII的特征,并在经典架构上测试其影响。

Result: 结果显示,不同视频编码器对模型性能有显著影响,并揭示了特定特征的使用模式和错误。

Insight: 不同视频编码器的特征可能互为补充,未来研究可以探索特征融合以提升性能。

Abstract: Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.

[138] Do Satellite Tasks Need Special Pretraining?

Ani Vanyan,Alvard Barseghyan,Hakob Tamazyan,Tigran Galstyan,Vahan Huroyan,Naira Hovakimyan,Hrant Khachatrian

Main category: cs.CV

TL;DR: 论文探讨卫星任务是否需要专门预训练的基础模型,通过实验证明通用视觉基础模型在小规模任务中表现不逊于专用模型。

Details Motivation: 研究动机是验证卫星任务是否需要专门为遥感数据设计的基础模型,而非依赖通用视觉基础模型。

Contribution: 主要贡献包括:设计了衡量遥感模型泛化能力的基准,并在ViT-B规模上验证了通用模型的竞争力。

Method: 方法包括:在MillionAID数据集上训练自监督视觉编码器iBOT,并针对遥感特点进行改进。

Result: 结果显示,专用预训练模型在ViT-B规模上未带来显著改进。

Insight: 研究揭示了在小规模任务中,通用基础模型的潜力,可能减少专用模型的需求。

Abstract: Foundation models have advanced machine learning across various modalities, including images. Recently multiple teams trained foundation models specialized for remote sensing applications. This line of research is motivated by the distinct characteristics of remote sensing imagery, specific applications and types of robustness useful for satellite image analysis. In this work we systematically challenge the idea that specific foundation models are more useful than general-purpose vision foundation models, at least in the small scale. First, we design a simple benchmark that measures generalization of remote sensing models towards images with lower resolution for two downstream tasks. Second, we train iBOT, a self-supervised vision encoder, on MillionAID, an ImageNet-scale satellite imagery dataset, with several modifications specific to remote sensing. We show that none of those pretrained models bring consistent improvements upon general-purpose baselines at the ViT-B scale.

[139] Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

Shraman Pramanick,Effrosyni Mavroudi,Yale Song,Rama Chellappa,Lorenzo Torresani,Triantafyllos Afouras

Main category: cs.CV

TL;DR: 论文提出了一种称为ED-VTG的方法,利用多模态大型语言模型(LLMs)进行细粒度视频时间定位,通过两阶段流程实现高效查询定位。

Details Motivation: 现有方法在视频时间定位中常因查询信息不足或噪声导致性能受限,因此需要一种能丰富查询内容并高效定位的方法。

Contribution: 提出了ED-VTG方法,通过多模态LLMs丰富查询内容并利用轻量解码器精准定位,达到了最先进的性能。

Method: 两阶段流程:1)查询丰富化,补充缺失细节;2)轻量解码器基于上下文表示进行精准边界预测。采用多实例学习目标减少噪声影响。

Result: 在多个基准测试中取得最优性能,显著优于其他LLM-based方法,并在零样本场景中表现突出。

Insight: 查询丰富化能显著提升定位性能,轻量解码器结合上下文表示是实现高效时间定位的关键。

Abstract: We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.

[140] Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding

Yutong Zhong

Main category: cs.CV

TL;DR: 论文提出了一种名为W2R2的训练框架,通过解耦表示学习和针对性捷径抑制,解决视频语言模型(VLMs)在3D空间推理中的‘2D语义偏差’问题,提升了3D定位精度。

Details Motivation: 现有的多模态3D grounding方法过度依赖2D图像特征,忽略了3D几何输入,导致性能不佳。W2R2旨在通过解耦语义(What)和空间(Where)表示来解决这一问题。

Contribution: 1. 提出了W2R2框架,通过解耦2D语义和3D空间表示改善3D grounding。2. 设计了双目标损失函数(Alignment Loss和Pseudo-Label Loss)以抑制2D主导的伪输出。

Method: 1. 将2D特征作为‘What’语义标记,3D特征作为‘Where’空间锚点。2. 使用Alignment Loss监督多模态融合预测,Pseudo-Label Loss惩罚2D主导的伪输出。

Result: 在ScanRefer和ScanQA数据集上实验验证,W2R2显著提升了定位精度和鲁棒性,尤其在复杂室外场景中表现优异。

Insight: 解耦语义和空间表示能有效抑制2D偏差,提升3D grounding性能;双目标损失函数的设计是关键。

Abstract: Multimodal 3D grounding has garnered considerable interest in Vision-Language Models (VLMs) \cite{yin2025spatial} for advancing spatial reasoning in complex environments. However, these models suffer from a severe “2D semantic bias” that arises from over-reliance on 2D image features for coarse localization, largely disregarding 3D geometric inputs and resulting in suboptimal fusion performance. In this paper, we propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue via disentangled representation learning and targeted shortcut suppression. Our approach fundamentally reshapes the model’s internal space by designating 2D features as semantic beacons for “What” identification and 3D features as spatial anchors for “Where” localization, enabling precise 3D grounding without modifying inference architecture. Key components include a dual-objective loss function with an Alignment Loss that supervises fused predictions using adapted cross-entropy for multimodal synergy, and a Pseudo-Label Loss that penalizes overly effective 2D-dominant pseudo-outputs via a margin-based mechanism. Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness, particularly in cluttered outdoor scenes.

[141] Conditional Synthetic Live and Spoof Fingerprint Generation

Syed Konain Abbas,Sandip Purnapatra,M. G. Sarwar Murshed,Conor Miller-Lynch,Lambert Igene,Soumyabrata Dey,Stephanie Schuckers,Faraz Hussain

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于条件生成对抗网络(GAN)的新方法,用于生成高分辨率的合成指纹图像(包括真实指纹和仿冒指纹),解决了生物特征数据收集中的隐私、成本和可访问性问题。

Details Motivation: 由于大规模指纹数据集的收集费时费力且涉及严格的隐私保护,研究者探索利用合成指纹数据来解决这些问题,同时为开发鲁棒的仿冒指纹检测系统提供支持。

Contribution: 论文的主要贡献包括:1) 使用条件StyleGAN2-ADA和StyleGAN3架构生成高质量合成指纹;2) 利用CycleGAN生成仿冒指纹,模拟多种攻击材料;3) 创建了两个包含多层数据的合成指纹数据集(DB2和DB3)。

Method: 方法包括:1) 使用StyleGAN2-ADA和StyleGAN3生成条件合成的真实指纹;2) 通过CycleGAN将真实指纹转换为多种材料的仿冒指纹。

Result: 实验结果显示,StyleGAN3的FID低至5,合成指纹在0.01% FAR下的TAR达到99.47%,StyleGAN2-ADA的TAR为98.67%。合成的指纹在质量和隐私保护方面表现出色。

Insight: 该研究表明,合成指纹数据可以替代真实数据,有效解决隐私和成本问题,同时为生物特征识别和仿冒检测提供了高质量的训练数据。

Abstract: Large fingerprint datasets, while important for training and evaluation, are time-consuming and expensive to collect and require strict privacy measures. Researchers are exploring the use of synthetic fingerprint data to address these issues. This paper presents a novel approach for generating synthetic fingerprint images (both spoof and live), addressing concerns related to privacy, cost, and accessibility in biometric data collection. Our approach utilizes conditional StyleGAN2-ADA and StyleGAN3 architectures to produce high-resolution synthetic live fingerprints, conditioned on specific finger identities (thumb through little finger). Additionally, we employ CycleGANs to translate these into realistic spoof fingerprints, simulating a variety of presentation attack materials (e.g., EcoFlex, Play-Doh). These synthetic spoof fingerprints are crucial for developing robust spoof detection systems. Through these generative models, we created two synthetic datasets (DB2 and DB3), each containing 1,500 fingerprint images of all ten fingers with multiple impressions per finger, and including corresponding spoofs in eight material types. The results indicate robust performance: our StyleGAN3 model achieves a Fr'echet Inception Distance (FID) as low as 5, and the generated fingerprints achieve a True Accept Rate of 99.47% at a 0.01% False Accept Rate. The StyleGAN2-ADA model achieved a TAR of 98.67% at the same 0.01% FAR. We assess fingerprint quality using standard metrics (NFIQ2, MINDTCT), and notably, matching experiments confirm strong privacy preservation, with no significant evidence of identity leakage, confirming the strong privacy-preserving properties of our synthetic datasets.

[142] Video Reasoning without Training

Deepak Sridhar,Kartikeya Bhardwaj,Jeya Pradha Jeyaraj,Nuno Vasconcelos,Ankita Nayak,Harris Teague

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练的短视频推理方法V-Reason,通过熵信号优化模型推理行为,显著提升性能并减少计算开销。

Details Motivation: 传统的视频推理方法依赖昂贵的强化学习和冗余的思维链机制,计算成本高且推理控制机制有限。

Contribution: 1. 发现高质量模型通过微探索和微利用行为保持推理过程的稳定性;2. 提出使用熵信号直接优化模型推理行为的方法V-Reason,无需训练;3. 实验显示该方法显著提升推理性能,并大幅减少计算开销。

Method: 通过分析模型输出的熵信号,设计了一个基于熵的小型可训练控制器,在推理阶段优化模型的微探索和微利用行为。

Result: V-Reason在多个视频推理数据集上表现优异,平均准确率接近强化学习模型(仅差0.6%),同时减少58.6%的输出token。

Insight: 高质量模型的推理过程具有阶段性(探索与收敛),熵信号可有效指导推理行为的优化。

Abstract: Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model’s output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this “thinking” process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model’s behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model’s micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.

[143] How Universal Are SAM2 Features?

Masoud Khairi Atani,Alon Harell,Hyomin Choi,Runyu Yang,Fabien Racape,Ivan V. Bajic

Main category: cs.CV

TL;DR: 本文研究了通用视觉基础模型(Hiera)与专用分割模型(SAM2)在特征泛化能力上的权衡,发现SAM2在空间相关任务(如深度估计)上表现优异,但在语义较远的任务(如姿态估计和图像描述)上表现较差。

Details Motivation: 研究通用与专用视觉模型在特征泛化能力上的权衡,为高效的特征编码设计提供定量依据。

Contribution: 1. 比较了Hiera和SAM2在不同任务上的特征适应性;2. 揭示了SAM2特征的空间相关任务优势与语义任务劣势;3. 提出了一种跨层分析策略,量化了表征瓶颈。

Method: 使用轻量级可训练颈部模块(neck)对冻结的特征进行适应性测试,并通过信息论成本量化专用化的代价。

Result: SAM2在深度估计等空间任务上表现优异,但在姿态估计和图像描述任务上不如Hiera,表明其语义信息损失较大。

Insight: 专用化模型在特定任务上表现优越,但会牺牲通用语义信息的捕捉能力,为特征编码设计提供了权衡依据。

Abstract: The trade-off between general-purpose foundation vision models and their specialized counterparts is critical for efficient feature coding design and is not yet fully understood. We investigate this trade-off by comparing the feature versatility of the general-purpose Hiera encoder against the segmentation-specialized Segment Anything Model 2 (SAM2). Using a lightweight, trainable neck to probe the adaptability of their frozen features, we quantify the information-theoretic cost of specialization. Our results reveal that while SAM2’s specialization is highly effective for spatially-related tasks like depth estimation, it comes at a cost. The specialized SAM2 encoder underperforms its generalist predecessor, Hiera, on conceptually distant tasks such as pose estimation and image captioning, demonstrating a measurable loss of broader semantic information. A novel cross-neck analysis on SAM2 reveals that each level of adaptation creates a further representational bottleneck. Our analysis illuminates these trade-offs in feature universality, providing a quantitative foundation for designing efficient feature coding and adaptation strategies for diverse downstream applications.

[144] ProDAT: Progressive Density-Aware Tail-Drop for Point Cloud Coding

Zhe Luo,Wenjing Jia,Stuart Perry

Main category: cs.CV

TL;DR: 论文提出了一种名为ProDAT的渐进密度感知尾部丢弃机制,用于实现点云的渐进编码,提升编码效率并支持多种比特率解码。

Details Motivation: 点云在自动驾驶、增强现实等应用中需求量大,但数据量大和带宽限制导致高质量服务难以部署。现有学习方法无法实现渐进解码。

Contribution: 提出ProDAT,通过密度信息指导自适应解码,实现了单模型支持多比特率渐进解码,并在编码效率上优于现有方法。

Method: 利用密度信息作为指导信号,自适应解码潜在特征和坐标,实现渐进解码。

Result: 在SemanticKITTI和ShapeNet数据集上分别取得了28.6%和18.15%的BD-rate提升。

Insight: 密度信息可以作为点云编码的有效指导信号,单一模型可以实现高效的渐进解码。

Abstract: Three-dimensional (3D) point clouds are becoming increasingly vital in applications such as autonomous driving, augmented reality, and immersive communication, demanding real-time processing and low latency. However, their large data volumes and bandwidth constraints hinder the deployment of high-quality services in resource-limited environments. Progres- sive coding, which allows for decoding at varying levels of detail, provides an alternative by allowing initial partial decoding with subsequent refinement. Although recent learning-based point cloud geometry coding methods have achieved notable success, their fixed latent representation does not support progressive decoding. To bridge this gap, we propose ProDAT, a novel density-aware tail-drop mechanism for progressive point cloud coding. By leveraging density information as a guidance signal, latent features and coordinates are decoded adaptively based on their significance, therefore achieving progressive decoding at multiple bitrates using one single model. Experimental results on benchmark datasets show that the proposed ProDAT not only enables progressive coding but also achieves superior coding efficiency compared to state-of-the-art learning-based coding techniques, with over 28.6% BD-rate improvement for PSNR- D2 on SemanticKITTI and over 18.15% for ShapeNet

[145] Towards a Generalizable Fusion Architecture for Multimodal Object Detection

Jad Berjawi,Yoann Dupas,Christophe C’erin

Main category: cs.CV

TL;DR: 论文提出了一种通用的多模态目标检测融合架构FMCAF,通过频率域滤波和跨注意力融合模块提升RGB和红外数据的融合效果,在多种数据集上表现优于传统方法。

Details Motivation: 多模态目标检测在复杂条件下依赖不同传感器的互补信息提升鲁棒性,但现有方法通常针对特定数据集设计,缺乏通用性。

Contribution: 提出了Filtered Multimodal Cross Attention Fusion (FMCAF)架构,结合频率域滤波和跨注意力模块,实现高效的多模态特征融合。

Method: FMCAF包含频率域滤波模块(Freq-Filter)去除冗余频谱特征,以及跨注意力融合模块(MCAF)增强跨模态特征共享。

Result: 在LLVIP和VEDAI数据集上分别提升1.1%和13.9%的mAP@50,优于传统拼接融合方法。

Insight: FMCAF展示了通用融合架构的潜力,无需针对特定数据集调整即可提升多模态目标检测性能。

Abstract: Multimodal object detection improves robustness in chal- lenging conditions by leveraging complementary cues from multiple sensor modalities. We introduce Filtered Multi- Modal Cross Attention Fusion (FMCAF), a preprocess- ing architecture designed to enhance the fusion of RGB and infrared (IR) inputs. FMCAF combines a frequency- domain filtering block (Freq-Filter) to suppress redun- dant spectral features with a cross-attention-based fusion module (MCAF) to improve intermodal feature sharing. Unlike approaches tailored to specific datasets, FMCAF aims for generalizability, improving performance across different multimodal challenges without requiring dataset- specific tuning. On LLVIP (low-light pedestrian detec- tion) and VEDAI (aerial vehicle detection), FMCAF outper- forms traditional fusion (concatenation), achieving +13.9% mAP@50 on VEDAI and +1.1% on LLVIP. These results support the potential of FMCAF as a flexible foundation for robust multimodal fusion in future detection pipelines.

[146] Boosting Fidelity for Pre-Trained-Diffusion-Based Low-Light Image Enhancement via Condition Refinement

Xiaogang Xu,Jian Wang,Yunfan Lu,Ruihang Chu,Ruixing Wang,Jiafei Wu,Bei Yu,Liang Lin

Main category: cs.CV

TL;DR: 该论文提出了一种优化预训练扩散模型条件的方法,通过条件精炼提升低光图像增强的内容保真度,同时保持真实感和美观性。

Details Motivation: 当前基于预训练扩散模型(PTDB)的方法在低光图像增强中因缺乏合适的条件隐空间建模和双向交互机制,导致内容保真度不足。

Contribution: 1. 提出了一种条件精炼策略,提升PTDB方法的保真度;2. 引入了隐空间细化管道,恢复VAE编码中丢失的空间细节;3. 实现了动态条件-噪声隐空间交互,改善低光图像恢复性能。

Method: 1. 设计隐空间细化管道,结合生成先验恢复丢失的空间细节;2. 通过动态双向交互机制优化条件隐空间与噪声隐空间的交互。

Result: 实验表明,该方法显著提升了PTDB方法在低光图像增强中的保真度。

Insight: 在扩散模型中,条件隐空间的建模和动态交互对提升任务性能至关重要,尤其是在低光等复杂场景下。

Abstract: Diffusion-based methods, leveraging pre-trained large models like Stable Diffusion via ControlNet, have achieved remarkable performance in several low-level vision tasks. However, Pre-Trained Diffusion-Based (PTDB) methods often sacrifice content fidelity to attain higher perceptual realism. This issue is exacerbated in low-light scenarios, where severely degraded information caused by the darkness limits effective control. We identify two primary causes of fidelity loss: the absence of suitable conditional latent modeling and the lack of bidirectional interaction between the conditional latent and noisy latent in the diffusion process. To address this, we propose a novel optimization strategy for conditioning in pre-trained diffusion models, enhancing fidelity while preserving realism and aesthetics. Our method introduces a mechanism to recover spatial details lost during VAE encoding, i.e., a latent refinement pipeline incorporating generative priors. Additionally, the refined latent condition interacts dynamically with the noisy latent, leading to improved restoration performance. Our approach is plug-and-play, seamlessly integrating into existing diffusion networks to provide more effective control. Extensive experiments demonstrate significant fidelity improvements in PTDB methods.

[147] Towards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras

Hodaka Kawachi,Tomoya Nakamura,Hiroaki Santo,SaiKiran Kumar Tedla,Trevor Dalton Canham,Yasushi Yagi,Michael S. Brown

Main category: cs.CV

TL;DR: 这篇论文提出了一种通过LED光源的环境光照生成视觉不可见水印的方法,专为消费级相机设计。

Details Motivation: 传统水印技术可能在视觉上对用户产生干扰,而该方法旨在通过环境光照生成对肉眼几乎不可见但对相机高度可检测的水印。

Contribution: 其主要贡献是提出了一种基于光谱调制的LED光源优化方法,能够在视觉上不干扰用户的同时,为消费级相机提供高效的水印检测能力。

Method: 方法结合了人类视觉系统对可见光谱的敏感性、消费级相机传感器的光谱响应特性,以及窄带LED生成宽带光谱的能力。采用光谱调制而非强度调制以确保不可见性。

Result: 该方法支持标准低帧率(30-60 fps)的水印提取,信息传输速率适中(10秒视频嵌入128位),适用于隐私保护和内容验证等应用。

Insight: 通过环境光照而非直接图像修改实现水印嵌入,为隐私保护和内容验证提供了新的技术路径。

Abstract: This paper introduces a method for using LED-based environmental lighting to produce visually imperceptible watermarks for consumer cameras. Our approach optimizes an LED light source’s spectral profile to be minimally visible to the human eye while remaining highly detectable by typical consumer cameras. The method jointly considers the human visual system’s sensitivity to visible spectra, modern consumer camera sensors’ spectral sensitivity, and narrowband LEDs’ ability to generate broadband spectra perceived as “white light” (specifically, D65 illumination). To ensure imperceptibility, we employ spectral modulation rather than intensity modulation. Unlike conventional visible light communication, our approach enables watermark extraction at standard low frame rates (30-60 fps). While the information transfer rate is modest-embedding 128 bits within a 10-second video clip-this capacity is sufficient for essential metadata supporting privacy protection and content verification.

[148] GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection

Xin Gao,Jiyao Liu,Guanghao Li,Yueming Lyu,Jianxiong Gao,Weichen Yu,Ningsheng Xu,Liang Wang,Caifeng Shan,Ziwei Liu,Chenyang Si

Main category: cs.CV

TL;DR: GOOD 是一种无需训练的方法,利用现成的分布内分类器引导扩散采样轨迹生成分布外(OOD)样本,通过图像级和特征级双重引导提升OOD检测性能。

Details Motivation: 现有基于文本条件嵌入扰动的OOD样本生成方法存在语义不稳定和多样性不足的问题,限制了其在真实OOD场景中的泛化能力。

Contribution: 提出了GOOD框架,通过图像级和特征级的双重引导生成更多样可控的OOD样本,并设计了统一的OOD评分机制。

Method: 结合图像级的对数划分梯度降低输入可能性,以及特征级的k-NN距离促进稀疏区域采样,实现双引导扩散采样。

Result: 实验表明,GOOD生成的样本显著提升了OOD检测的性能。

Insight: 双引导设计为OOD样本生成提供了更高的可控性和多样性,统一的OOD评分为检测任务提供了更强的鲁棒性。

Abstract: Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing text-conditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space. (2) Feature-level guidance, derived from k-NN distance in the classifier’s latent space, promotes sampling in feature-sparse regions. Hence, this dual-guidance design enables more controllable and diverse OOD sample generation. Additionally, we introduce a unified OOD score that adaptively combines image and feature discrepancies, enhancing detection robustness. We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance.

[149] KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation

WenBo Xu,Liu Liu,Li Zhang,Ran Zhang,Hao Wu,Dan Guo,Meng Wang

Main category: cs.CV

TL;DR: KineDiff3D是一个统一的框架,用于从单视角输入中重建多样化铰接物体并进行姿态估计。它通过结合VAE和扩散模型,以及迭代优化模块,实现了高精度的重建和运动学参数估计。

Details Motivation: 铰接物体(如笔记本、抽屉)因多部件几何结构和可变关节配置而难以重建和姿态估计。传统方法难以处理这种结构多样性,因此作者提出了一种新方法。

Contribution: 1. 提出Kinematic-Aware VAE (KA-VAE),将几何、关节角度和部件分割编码为结构化潜在空间。2. 采用两个条件扩散模型,分别用于全局姿态和关节参数回归以及潜在码生成。3. 设计了迭代优化模块,双向优化重建精度和运动学参数。

Method: 1. 使用KA-VAE编码完整几何、关节角度和部件分割。2. 通过两个条件扩散模型分别处理全局姿态回归和潜在码生成。3. 引入基于Chamfer距离的迭代优化模块,保证运动学约束。

Result: 在合成、半合成和真实数据集上的实验表明,KineDiff3D能准确重建铰接物体并估计其运动学属性。

Insight: 结合VAE和扩散模型,以及迭代优化,可以显著提升铰接物体的重建精度和运动学参数估计能力。

Abstract: Articulated objects, such as laptops and drawers, exhibit significant challenges for 3D reconstruction and pose estimation due to their multi-part geometries and variable joint configurations, which introduce structural diversity across different states. To address these challenges, we propose KineDiff3D: Kinematic-Aware Diffusion for Category-Level Articulated Object Shape Reconstruction and Generation, a unified framework for reconstructing diverse articulated instances and pose estimation from single view input. Specifically, we first encode complete geometry (SDFs), joint angles, and part segmentation into a structured latent space via a novel Kinematic-Aware VAE (KA-VAE). In addition, we employ two conditional diffusion models: one for regressing global pose (SE(3)) and joint parameters, and another for generating the kinematic-aware latent code from partial observations. Finally, we produce an iterative optimization module that bidirectionally refines reconstruction accuracy and kinematic parameters via Chamfer-distance minimization while preserving articulation constraints. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate the effectiveness of our approach in accurately reconstructing articulated objects and estimating their kinematic properties.

[150] GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image

Yinghui Wang,Xinyu Zhang,Peng Du

Main category: cs.CV

TL;DR: 论文提出GACO-CAD,一种两阶段训练框架,通过几何增强和简洁性优化,从单张图像生成更准确的CAD模型。

Details Motivation: 当前MLLMs在2D图像推断3D几何时表现不佳,需要结合空间提示和改进建模过程以提升效果。

Contribution: 提出了结合深度和表面法线图的几何增强输入,并设计了改进建模简洁性的强化学习奖励机制。

Method: 两阶段框架:1) 监督微调阶段结合几何先验;2) 强化学习阶段引入分组长度奖励优化建模过程。

Result: 在DeepCAD和Fusion360数据集上表现最优,兼顾几何准确性和建模简洁性。

Insight: 几何先验和简洁建模奖励的结合显著提升了模型生成的CAD质量。

Abstract: Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.

[151] Investigating Adversarial Robustness against Preprocessing used in Blackbox Face Recognition

Roland Croft,Brian Du,Darcy Joseph,Sharath Kumar

Main category: cs.CV

TL;DR: 该论文研究了在人脸识别(FR)系统的黑盒设置中,预处理技术对对抗攻击迁移性的影响,并提出了一种预处理不变的方法,提升了攻击迁移性。

Details Motivation: 人脸识别系统容易受到对抗样本攻击,但黑盒设置中预处理的作用常被忽视。本文旨在探究预处理技术对对抗攻击效果的影响,并提出改进方法。

Contribution: 1. 揭示了预处理技术(如人脸检测模型和插值方法)对对抗攻击成功率的显著影响;2. 提出了一种预处理不变的方法,显著提升攻击迁移性(最高27%)。

Method: 研究了多种预处理技术对对抗攻击的影响,并通过输入变换提出预处理不变的方法,以增强攻击在黑盒环境中的迁移性。

Result: 实验表明,人脸检测模型的选择可导致攻击成功率下降78%,而输入变换方法能将攻击迁移性提升27%。

Insight: 预处理是人脸识别系统中的关键环节,对抗攻击设计需考虑其对攻击效果的影响,以提高对抗样本的泛化能力。

Abstract: Face Recognition (FR) models have been shown to be vulnerable to adversarial examples that subtly alter benign facial images, exposing blind spots in these systems, as well as protecting user privacy. End-to-end FR systems first obtain preprocessed faces from diverse facial imagery prior to computing the similarity of the deep feature embeddings. Whilst face preprocessing is a critical component of FR systems, and hence adversarial attacks against them, we observe that this preprocessing is often overlooked in blackbox settings. Our study seeks to investigate the transferability of several out-of-the-box state-of-the-art adversarial attacks against FR when applied against different preprocessing techniques used in a blackbox setting. We observe that the choice of face detection model can degrade the attack success rate by up to 78%, whereas choice of interpolation method during downsampling has relatively minimal impacts. Furthermore, we find that the requirement for facial preprocessing even degrades attack strength in a whitebox setting, due to the unintended interaction of produced noise vectors against face detection models. Based on these findings, we propose a preprocessing-invariant method using input transformations that improves the transferability of the studied attacks by up to 27%. Our findings highlight the importance of preprocessing in FR systems, and the need for its consideration towards improving the adversarial generalisation of facial adversarial examples.

[152] Benchmarking Out-of-Distribution Detection for Plankton Recognition: A Systematic Evaluation of Advanced Methods in Marine Ecological Monitoring

Yingzi Han,Jiakai He,Chuanlong Xie,Jianping Li

Main category: cs.CV

TL;DR: 该论文系统地评估了22种OoD检测方法在浮游生物识别中的表现,基于DYB-PlanktonNet数据集构建了多种分布偏移场景的基准测试。ViM方法在Far-OoD场景中表现最佳,为该领域提供了重要参考。

Details Motivation: 浮游生物识别模型在真实部署中面临分布偏移问题,可能导致预测错误。现有研究缺乏对最新计算机视觉方法的系统整合和大规模基准测试。

Contribution: 1. 首个针对浮游生物识别的OoD检测大规模系统性评估;2. 设计了多种分布偏移场景的基准测试;3. ViM方法在Far-OoD场景中表现突出。

Method: 基于DYB-PlanktonNet数据集构建OoD基准测试,系统评估22种OoD检测方法,重点关注Far-OoD场景的性能比较。

Result: 实验结果表明,ViM方法在Far-OoD场景中显著优于其他方法,关键指标提升明显。

Insight: 浮游生物识别的OoD问题需要专门方法,ViM在这一任务中表现出色,未来研究可基于此基准进一步优化。

Abstract: Automated plankton recognition models face significant challenges during real-world deployment due to distribution shifts (Out-of-Distribution, OoD) between training and test data. This stems from plankton’s complex morphologies, vast species diversity, and the continuous discovery of novel species, which leads to unpredictable errors during inference. Despite rapid advancements in OoD detection methods in recent years, the field of plankton recognition still lacks a systematic integration of the latest computer vision developments and a unified benchmark for large-scale evaluation. To address this, this paper meticulously designed a series of OoD benchmarks simulating various distribution shift scenarios based on the DYB-PlanktonNet dataset \cite{875n-f104-21}, and systematically evaluated twenty-two OoD detection methods. Extensive experimental results demonstrate that the ViM \cite{wang2022vim} method significantly outperforms other approaches in our constructed benchmarks, particularly excelling in Far-OoD scenarios with substantial improvements in key metrics. This comprehensive evaluation not only provides a reliable reference for algorithm selection in automated plankton recognition but also lays a solid foundation for future research in plankton OoD detection. To our knowledge, this study marks the first large-scale, systematic evaluation and analysis of Out-of-Distribution data detection methods in plankton recognition. Code is available at https://github.com/BlackJack0083/PlanktonOoD.

[153] Capturing Head Avatar with Hand Contacts from a Monocular Video

Haonan He,Yufeng Zheng,Jie Song

Main category: cs.CV

TL;DR: 本文提出了一种新颖的框架,用于从单目视频中捕捉带有手部接触的头部虚拟形象,解决了手部和面部交互时的非刚性变形问题。

Details Motivation: 现有的3D头部虚拟形象方法主要关注面部区域,忽略了手与面部的自然互动(如手托下巴或手指轻触脸颊),这些互动传达认知状态(如思考)。本文通过联合学习头部虚拟形象和手部引起的非刚性变形,填补了这一空白。

Contribution: 主要贡献包括:1) 提出了一种结合深度顺序损失和接触正则化的姿态跟踪方法;2) 学习了一种针对手部引起的面部变形的PCA基;3) 引入了接触损失以提高物理合理性。

Method: 方法分为两部分:1) 通过深度顺序损失和接触正则化确保手与面部之间的正确空间关系;2) 从一个手部-面部交互数据集中学习PCA基,估计PCA参数而非完整的变形场。此外,结合基于物理的仿真,使用接触损失减少穿模现象。

Result: 在iPhone拍摄的RGB(D)视频和合成数据集上评估,本文方法在面部外观和变形几何准确性上优于现有表面重建方法。

Insight: 手部与面部的交互是虚拟形象真实性的重要因素,物理约束和PCA基的引入显著提升了结果的真实感和几何精度。

Abstract: Photorealistic 3D head avatars are vital for telepresence, gaming, and VR. However, most methods focus solely on facial regions, ignoring natural hand-face interactions, such as a hand resting on the chin or fingers gently touching the cheek, which convey cognitive states like pondering. In this work, we present a novel framework that jointly learns detailed head avatars and the non-rigid deformations induced by hand-face interactions. There are two principal challenges in this task. First, naively tracking hand and face separately fails to capture their relative poses. To overcome this, we propose to combine depth order loss with contact regularization during pose tracking, ensuring correct spatial relationships between the face and hand. Second, no publicly available priors exist for hand-induced deformations, making them non-trivial to learn from monocular videos. To address this, we learn a PCA basis specific to hand-induced facial deformations from a face-hand interaction dataset. This reduces the problem to estimating a compact set of PCA parameters rather than a full spatial deformation field. Furthermore, inspired by physics-based simulation, we incorporate a contact loss that provides additional supervision, significantly reducing interpenetration artifacts and enhancing the physical plausibility of the results. We evaluate our approach on RGB(D) videos captured by an iPhone. Additionally, to better evaluate the reconstructed geometry, we construct a synthetic dataset of avatars with various types of hand interactions. We show that our method can capture better appearance and more accurate deforming geometry of the face than SOTA surface reconstruction methods.

[154] HIDISC: A Hyperbolic Framework for Domain Generalization with Generalized Category Discovery

Vaibhav Rathore,Divyam Gupta,Biplab Banerjee

Main category: cs.CV

TL;DR: HIDISC提出了一种双曲框架,用于解决领域泛化与广义类别发现的问题,通过GPT引导的扩散增强和切线空间插值技术,实现了高效且性能优异的结果。

Details Motivation: 现有广义类别发现(GCD)方法通常假设训练时的标记和无标记数据来自同一领域,限制了在分布偏移场景中的应用。HIDISC旨在解决这一限制,同时提升计算效率和性能。

Contribution: 1. 提出了HIDISC,一种无需域模拟的双曲表示学习框架;2. 引入GPT引导的扩散增强技术,生成多样化的领域变体;3. 设计了Tangent CutMix,一种曲率感知的插值方法;4. 提出了统一的损失函数,结合Busemann对齐和对比正则化。

Method: HIDISC通过GPT引导的扩散增强生成多样性样本,采用Tangent CutMix在切线空间合成伪新样本,结合Busemann对齐、双曲对比正则化和自适应离群排斥的损失函数优化表示空间。

Result: 在PACS、Office-Home和DomainNet数据集上,HIDISC表现优于现有欧几里得和双曲框架的GCD方法,实现了state-of-the-art的性能。

Insight: HIDISC展示了双曲几何在处理领域泛化和广义类别发现中的潜力,同时强调了数据增强和表示空间结构化的重要性。

Abstract: Generalized Category Discovery (GCD) aims to classify test-time samples into either seen categories** – available during training – or novel ones, without relying on label supervision. Most existing GCD methods assume simultaneous access to labeled and unlabeled data during training and arising from the same domain, limiting applicability in open-world scenarios involving distribution shifts. Domain Generalization with GCD (DG-GCD) lifts this constraint by requiring models to generalize to unseen domains containing novel categories, without accessing targetdomain data during training. The only prior DG-GCD method, DG2CD-Net, relies on episodic training with multiple synthetic domains and task vector aggregation, incurring high computational cost and error accumulation. We propose HIDISC, a hyperbolic representation learning framework that achieves domain and category-level generalization without episodic simulation. To expose the model to minimal but diverse domain variations, we augment the source domain using GPT-guided diffusion, avoiding overfitting while maintaining efficiency. To structure the representation space, we introduce Tangent CutMix, a curvature-aware interpolation that synthesizes pseudo-novel samples in tangent space, preserving manifold consistency. A unified loss – combining penalized Busemann alignment, hybrid hyperbolic contrastive regularization, and adaptive outlier repulsion – **facilitates compact, semantically structured embeddings. A learnable curvature parameter further adapts the geometry to dataset complexity. HIDISC achieves state-of-the-art results on PACS , Office-Home , and DomainNet, consistently outperforming the existing Euclidean and hyperbolic (DG)-GCD baselines.

[155] ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

Pu Zhang,Yuwei Li,Xingyuan Xian,Guoming Tang

Main category: cs.CV

TL;DR: ZSPAPrune提出了一种零样本、提示感知的视觉令牌修剪方法,通过在任务相关性和信息多样性之间找到平衡,显著降低了视觉语言模型的推理成本,同时保持了高性能。

Details Motivation: 视觉语言模型在处理大规模输入时会产生大量冗余的视觉令牌,导致高昂的推理成本。现有方法忽视了文本提示的指导,无法有效区分任务的优先级。

Contribution: 1. 提出了一种零样本、提示感知的令牌修剪方法;2. 将令牌修剪问题建模为任务相关性和信息多样性的平衡;3. 通过分层方法选择核心任务相关令牌并补充多样性令牌。

Method: 采用分层策略:首先选择核心任务相关令牌,随后补充多样性令牌以保留上下文。该方法无需额外训练,直接在推理时应用。

Result: 在多个模型和基准测试中,该方法仅需修剪90%令牌即可达到或超越SOTA性能,同时显著降低GPU内存占用和推理延迟。

Insight: 文本提示的引入是任务相关令牌选择的关键;分层方法在保留核心信息的同时优化了全局上下文。

Abstract: As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when pruning up to 90% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.

[156] From Pixels to People: Satellite-Based Mapping and Quantification of Riverbank Erosion and Lost Villages in Bangladesh

M Saifuzzaman Rafat,Mohd Ruhul Ameen,Akif Islam,Abu Saleh Musa Miah,Jungpil Shin

Main category: cs.CV

TL;DR: 该论文利用Segment Anything Model(SAM)和手工标注数据集,开发了一种高精度卫星图像分析方法,用于监测孟加拉国河岸侵蚀和消失的村庄,为政策制定者提供新工具。

Details Motivation: 孟加拉国的河流侵蚀导致村庄和农田消失,传统的人工监测方法效率低下且不精确,亟需一种自动化、高精度的监测方法。

Contribution: 1. 首个手工标注的孟加拉国消失村庄数据集;2. 针对河岸侵蚀任务微调的SAM模型;3. 提供量化土地损失的可视化证据的方法。

Method: 论文结合简单的色彩通道分析和微调SAM的掩码解码器,识别河岸侵蚀的细微特征。

Result: 模型在IoU和Dice评分上分别达到86.30%和92.60%,显著优于传统方法和现成深度学习模型。

Insight: AI模型在环境监测中的应用潜力巨大,尤其是在灾害预测和政策干预领域中,提供高效、精准的数据支持。

Abstract: The great rivers of Bangladesh, arteries of commerce and sustenance, are also agents of relentless destruction. Each year, they swallow whole villages and vast tracts of farmland, erasing communities from the map and displacing thousands of families. To track this slow-motion catastrophe has, until now, been a Herculean task for human analysts. Here we show how a powerful general-purpose vision model, the Segment Anything Model (SAM), can be adapted to this task with remarkable precision. To do this, we assembled a new dataset - a digital chronicle of loss compiled from historical Google Earth imagery of Bangladesh’s most vulnerable regions, including Mokterer Char Union, Kedarpur Union, Balchipara village, and Chowhali Upazila, from 2003 to 2025. Crucially, this dataset is the first to include manually annotated data on the settlements that have vanished beneath the water. Our method first uses a simple color-channel analysis to provide a rough segmentation of land and water, and then fine-tunes SAM’s mask decoder to recognize the subtle signatures of riverbank erosion. The resulting model demonstrates a keen eye for this destructive process, achieving a mean Intersection over Union of 86.30% and a Dice score of 92.60% - a performance that significantly surpasses traditional methods and off-the-shelf deep learning models. This work delivers three key contributions: the first annotated dataset of disappeared settlements in Bangladesh due to river erosion; a specialized AI model fine-tuned for this critical task; and a method for quantifying land loss with compelling visual evidence. Together, these tools provide a powerful new lens through which policymakers and disaster management agencies can monitor erosion, anticipate its trajectory, and ultimately protect the vulnerable communities in its path.

[157] Round Outcome Prediction in VALORANT Using Tactical Features from Video Analysis

Nirai Hayakawa,Kazumasa Shimari,Kazuma Yamasaki,Hirotatsu Hoshikawa,Rikuto Tsuchida,Kenichi Matsumoto

Main category: cs.CV

TL;DR: 该研究通过分析《VALORANT》比赛视频中的小地图信息,提取战术特征(如角色位置和游戏内事件),结合TimeSformer视频识别模型,显著提升了回合结果的预测准确率(81%),尤其是在回合中后期表现更优。

Details Motivation: 当前电竞比赛结果预测研究多基于比赛日志和统计数据,忽视了复杂的战术信息。本研究旨在通过分析FPS游戏《VALORANT》的比赛视频,提取战术特征,改进预测模型。

Contribution: 提出了一种基于视频分析的回合结果预测方法,通过从小地图信息中提取战术特征,显著提升了预测准确率,证明了战术特征在FPS游戏结果预测中的重要性。

Method: 利用TimeSformer视频识别模型分析比赛视频的小地图信息,提取角色位置和游戏内事件等战术特征,训练预测模型。实验对比了仅使用小地图信息和增强战术特征标签的数据集效果。

Result: 实验显示,使用增强战术特征标签的模型预测准确率达到约81%,尤其是在回合中后期表现显著优于仅使用小地图信息的模型。

Insight: 研究表明,从比赛视频中提取的战术特征对FPS游戏结果预测具有重要价值,未来可进一步探索更多战术特征的潜力。

Abstract: Recently, research on predicting match outcomes in esports has been actively conducted, but much of it is based on match log data and statistical information. This research targets the FPS game VALORANT, which requires complex strategies, and aims to build a round outcome prediction model by analyzing minimap information in match footage. Specifically, based on the video recognition model TimeSformer, we attempt to improve prediction accuracy by incorporating detailed tactical features extracted from minimap information, such as character position information and other in-game events. This paper reports preliminary results showing that a model trained on a dataset augmented with such tactical event labels achieved approximately 81% prediction accuracy, especially from the middle phases of a round onward, significantly outperforming a model trained on a dataset with the minimap information itself. This suggests that leveraging tactical features from match footage is highly effective for predicting round outcomes in VALORANT.

[158] $\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

Yingqi Fan,Anhao Zhao,Jinlan Fu,Junlong Tong,Hui Su,Yijie Pan,Wei Zhang,Xiaoyu Shen

Main category: cs.CV

TL;DR: 论文揭示了MLLMs的三阶段跨模态交互过程,并提出了VisiPruner,一种无需训练的剪枝框架,显著减少了计算开销。

Details Motivation: 现有MLLMs在处理多模态任务时计算开销巨大,且缺乏对模态信息处理过程的深入理解。

Contribution: 1. 揭示了MLLMs的三阶段跨模态交互过程;2. 提出VisiPruner框架,高效减少计算开销。

Method: 通过系统性分析MLLMs的跨模态交互,设计了一种无需训练的剪枝框架VisiPruner,动态减少视觉token的计算。

Result: VisiPruner在LLaVA-v1.5 7B上减少了99%的视觉相关注意力和53.9%的FLOPs,性能优于现有方法。

Insight: MLLMs的跨模态交互具有阶段性,高效设计应与其固有处理动态对齐。

Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, \textit{they lack a fundamental understanding of how MLLMs process and fuse multimodal information.} Through systematic analysis, we uncover a \textbf{three-stage} cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose \emph{VisiPruner}, a training-free pruning framework that reduces up to 99% of vision-related attention computations and 53.9% of FLOPs on LLaVA-v1.5 7B. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics. Our code is available at: https://github.com/EIT-NLP/VisiPruner.

[159] When One Moment Isn’t Enough: Multi-Moment Retrieval with Cross-Moment Interactions

Zhuo Cao,Heming Du,Bingqing Zhang,Xin Yu,Xue Li,Sen Wang

Main category: cs.CV

TL;DR: 该论文提出了多时刻检索(MMR)任务,以解决现有单时刻检索(SMR)方法在实际应用中的不足。作者发布了高质量数据集QV-M$^2$,并提出了新的评测指标。同时,提出了FlashMMR框架,通过多时刻后验证模块优化时刻边界,显著提升了性能。

Details Motivation: 现有时刻检索方法主要关注单时刻检索(SMR),但实际应用中一个查询可能对应多个相关时刻。为解决这一问题,论文提出了多时刻检索任务及其解决方案。

Contribution: 1. 发布高质量多时刻检索数据集QV-M$^2$和新的评测指标;2. 提出FlashMMR框架,通过多时刻后验证模块优化检索结果。

Method: FlashMMR框架包含多时刻后验证模块,通过约束时间调整和验证模块重新评估候选片段,筛选低置信度提议,实现鲁棒的多时刻对齐。

Result: 在QV-M$^2$数据集上,FlashMMR超越了此前最优方法,G-mAP提升3.00%,mAP@3+tgt提升2.70%,mR@3提升2.56%。

Insight: 多时刻检索更贴近实际应用需求,QV-M$^2$为训练和评测MMR模型提供了有效基准,FlashMMR为后续研究提供了强基线。

Abstract: Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.

[160] Fair and Interpretable Deepfake Detection in Videos

Akihito Yoshii,Ryosuke Sonoda,Ramya Srinivasan

Main category: cs.CV

TL;DR: 本文提出了一种公平且可解释的深度伪造检测框架,通过时序特征学习和人口统计感知数据增强提升检测的公平性、可靠性和可解释性。

Details Motivation: 现有深度伪造检测方法存在偏见、透明度不足和未能捕捉时序信息的问题,导致对不同人口群体的决策不公平且结果不可靠。

Contribution: 本文的主要贡献包括:1) 提出结合时序特征学习和人口统计感知数据增强的公平检测框架;2) 引入序列聚类和概念提取提升可靠性和可解释性;3) 提出频域变换的数据增强方法以均衡不同群体和保留伪造痕迹。

Method: 方法包括:1) 使用序列聚类进行深度伪造视频的时序建模;2) 概念提取实现可解释性;3) 人口统计感知数据增强平衡群体分布并通过频域变换保留伪造特征。

Result: 在FaceForensics++、DFD、Celeb-DF和DFDC数据集上的实验表明,相比现有方法,该方法在公平性和准确性之间取得了最佳平衡。

Insight: 通过融入公平性和可解释性设计,可以显著提升深度伪造检测的实用性,尤其是在对不同人口群体的通用性方面。

Abstract: Existing deepfake detection methods often exhibit bias, lack transparency, and fail to capture temporal information, leading to biased decisions and unreliable results across different demographic groups. In this paper, we propose a fairness-aware deepfake detection framework that integrates temporal feature learning and demographic-aware data augmentation to enhance fairness and interpretability. Our method leverages sequence-based clustering for temporal modeling of deepfake videos and concept extraction to improve detection reliability while also facilitating interpretable decisions for non-expert users. Additionally, we introduce a demography-aware data augmentation method that balances underrepresented groups and applies frequency-domain transformations to preserve deepfake artifacts, thereby mitigating bias and improving generalization. Extensive experiments on FaceForensics++, DFD, Celeb-DF, and DFDC datasets using state-of-the-art (SoTA) architectures (Xception, ResNet) demonstrate the efficacy of the proposed method in obtaining the best tradeoff between fairness and accuracy when compared to SoTA.

[161] FineVision: Open Data Is All You Need

Luis Wiedmann,Orr Zohar,Amir Mahla,Xiaohan Wang,Rui Li,Thibaud Frere,Leandro von Werra,Aritra Roy Gosthipaty,Andrés Marafioti

Main category: cs.CV

TL;DR: FineVision是一个经过精心收集、整理和统一的视觉语言模型(VLM)数据集,包含2400万个样本,是目前最大的开放资源。通过半自动化人工审核流程,统一了200多个来源的数据,并在严格去重和去污的基础上提升了模型性能。

Details Motivation: 现有的视觉语言模型数据集存在碎片化、不一致和污染问题,限制了模型的性能提升和研究进展。FineVision旨在提供一个高质量、统一且大规模的数据集来解决这些问题。

Contribution: 1)提出了目前最大的开放视觉语言数据集FineVision,包含2400万个样本;2)设计了一套半自动化人工审核的流程,确保数据质量和一致性;3)通过去重和去污处理提升了数据集的纯净度。

Method: 采用半自动化人工审核流程:自动化完成批量数据摄入和模式映射,人工审核员验证映射准确性、多样性、安全和格式,并进行针对性修复。同时进行严格的数据去重和去污处理。

Result: FineVision训练的模型在广泛的评估任务中显著优于现有数据集训练的模型,证明了其规模、数据质量和人工监督的价值。

Insight: 1)数据质量和一致性对VLM性能至关重要;2)人工监督和自动化结合可以有效提升数据集质量;3)开放数据集可以极大促进研究进展。

Abstract: The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

[162] Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models

Katie Luo,Jingwei Ji,Tong He,Runsheng Xu,Yichen Xie,Dragomir Anguelov,Mingxing Tan

Main category: cs.CV

TL;DR: PnF是一种即插即用的方法,通过将多模态大语言模型(MLLMs)与现有运动预测模型结合,提升复杂场景下的运动预测性能。

Details Motivation: 当前自动驾驶系统的运动预测模型在标准条件下表现可靠,但难以高效泛化到多样化的现实场景。PnF旨在通过自然语言描述复杂场景,快速适应目标行为。

Contribution: 提出了PnF框架,利用MLLMs的零样本推理能力增强运动预测模型,无需微调即可显著提升性能。

Method: 设计提示词从MLLMs提取结构化场景理解,并将其蒸馏为可学习的嵌入,以增强现有行为预测模型。

Result: 在Waymo和nuScenes数据集上的实验表明,PnF在两个基准上均实现了性能提升。

Insight: 自然语言为复杂场景建模提供了更有效的方式,MLLMs的零样本能力可以无缝集成到下游任务中。

Abstract: Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning – making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.

[163] SG-CLDFF: A Novel Framework for Automated White Blood Cell Classification and Segmentation

Mehdi Zekriyapanah Gashti,Mostafa Mohammadpour,Ghasem Farjamnia

Main category: cs.CV

TL;DR: 该论文提出了一个新的框架SG-CLDFF,通过显著性引导和多尺度特征融合,提高了白细胞分类和分割的鲁棒性与可解释性。

Details Motivation: 白细胞分类和分割在显微镜图像中对于血液疾病的诊断至关重要,但由于染色变异性、复杂背景和类别不平衡等问题,仍然具有挑战性。

Contribution: 1) 提出了显著性引导的预处理方法;2) 设计了一个轻量级的混合主干网络(EfficientSwin风格)和多分辨率特征融合模块(ResNeXt-CC风格);3) 通过多任务学习和损失函数优化解决了类别不平衡问题;4) 提供了模型的可解释性分析(如Grad-CAM)。

Method: SG-CLDFF框架结合了显著性驱动的预处理和多尺度深度特征融合。通过多任务学习同时优化分割和分类任务,并使用加权损失和显著性对齐正则化优化训练。

Result: 在标准数据集(BCCD、LISC、ALL-IDB)上的实验表明,SG-CLDFF在IoU、F1和分类准确率上优于CNN和Transformer基线模型。消融实验验证了显著性预处理和特征融合模块的贡献。

Insight: 1) 显著性引导的特征融合有助于提高模型的鲁棒性;2) 多任务学习和损失优化可以有效缓解类别不平衡问题;3) 可解释性工具(如Grad-CAM)增强了模型在临床环境中的实用性。

Abstract: Accurate segmentation and classification of white blood cells (WBCs) in microscopic images are essential for diagnosis and monitoring of many hematological disorders, yet remain challenging due to staining variability, complex backgrounds, and class imbalance. In this paper, we introduce a novel Saliency-Guided Cross-Layer Deep Feature Fusion framework (SG-CLDFF) that tightly integrates saliency-driven preprocessing with multi-scale deep feature aggregation to improve both robustness and interpretability for WBC analysis. SG-CLDFF first computes saliency priors to highlight candidate WBC regions and guide subsequent feature extraction. A lightweight hybrid backbone (EfficientSwin-style) produces multi-resolution representations, which are fused by a ResNeXt-CC-inspired cross-layer fusion module to preserve complementary information from shallow and deep layers. The network is trained in a multi-task setup with concurrent segmentation and cell-type classification heads, using class-aware weighted losses and saliency-alignment regularization to mitigate imbalance and suppress background activation. Interpretability is enforced through Grad-CAM visualizations and saliency consistency checks, allowing model decisions to be inspected at the regional level. We validate the framework on standard public benchmarks (BCCD, LISC, ALL-IDB), reporting consistent gains in IoU, F1, and classification accuracy compared to strong CNN and transformer baselines. An ablation study also demonstrates the individual contributions of saliency preprocessing and cross-layer fusion. SG-CLDFF offers a practical and explainable path toward more reliable automated WBC analysis in clinical workflows.

[164] Machine Vision-Based Surgical Lighting System:Design and Implementation

Amir Gharghabi,Mahdi Hakiminezhad,Maryam Shafaei,Shaghayegh Gharghabi

Main category: cs.CV

TL;DR: 该论文提出了一种基于机器视觉的手术照明系统,利用YOLOv11算法自动检测手术标记点,并通过伺服电机调整光源位置,以减少外科医生的疲劳并提高照明一致性。

Details Motivation: 传统的手术照明系统依赖手动调整,容易导致外科医生疲劳和光照不一致,影响手术精确性和安全性。

Contribution: 1. 设计了一种基于YOLOv11算法的自动化手术照明系统;2. 通过伺服电机实现光源的精准定位;3. 系统在验证集上达到96.7%的mAP@50。

Method: 1. 使用YOLOv11算法检测手术标记点(蓝色球形标记);2. 通过两个伺服电机和倾斜-平移支架调整高功率LED光源位置。

Result: 系统在模拟手术场景的验证集上表现优异,mAP@50达到96.7%,显著提升了照明一致性和用户体验。

Insight: 机器视觉可以高效解决传统手术照明系统中的手动调整问题,同时为其他医疗设备的智能化提供了新思路。

Abstract: Effortless and ergonomically designed surgical lighting is critical for precision and safety during procedures. However, traditional systems often rely on manual adjustments, leading to surgeon fatigue, neck strain, and inconsistent illumination due to drift and shadowing. To address these challenges, we propose a novel surgical lighting system that leverages the YOLOv11 object detection algorithm to identify a blue marker placed above the target surgical site. A high-power LED light source is then directed to the identified location using two servomotors equipped with tilt-pan brackets. The YOLO model achieves 96.7% mAP@50 on the validation set consisting of annotated images simulating surgical scenes with the blue spherical marker. By automating the lighting process, this machine vision-based solution reduces physical strain on surgeons, improves consistency in illumination, and supports improved surgical outcomes.

[165] Exploring Structural Degradation in Dense Representations for Self-supervised Learning

Siran Dai,Qianqian Xu,Peisong Wen,Yang Liu,Qingming Huang

Main category: cs.CV

TL;DR: 论文研究了自监督学习(SSL)中一个反直觉现象:训练时间过长可能损害密集预测任务(如语义分割)的性能,称为自监督密集退化(SDD)。作者提出了一种无监督评估方法DSE,并基于此设计了模型选择和正则化策略,有效缓解SDD。

Details Motivation: 在SSL中,研究者通常关注训练时间越长性能越好,但发现密集预测任务存在性能退化现象。为解决这一问题,需要一种无需标注的评估方法。

Contribution: 1. 发现并验证了SDD现象的普遍性;2. 提出DSE指标用于无监督评估;3. 设计了基于DSE的模型选择和正则化策略。

Method: 1. 通过DSE(包含类相关性和有效维度两个指标)评估密集表示的结构质量;2. 利用DSE指导模型选择和正则化。

Result: 在16种SSL方法和4个基准测试上,模型选择平均提升mIoU 3.0%,正则化方法有效缓解SDD。

Insight: SSL的训练时长需针对任务优化,密集任务可能存在性能拐点;DSE是一种高效的无需标注的评估工具。

Abstract: In this work, we observe a counterintuitive phenomenon in self-supervised learning (SSL): longer training may impair the performance of dense prediction tasks (e.g., semantic segmentation). We refer to this phenomenon as Self-supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state-of-the-art SSL methods with various losses, architectures, and datasets. When the model performs suboptimally on dense tasks at the end of training, measuring the performance during training becomes essential. However, evaluating dense performance effectively without annotations remains an open challenge. To tackle this issue, we introduce a Dense representation Structure Estimator (DSE), composed of a class-relevance measure and an effective dimensionality measure. The proposed DSE is both theoretically grounded and empirically validated to be closely correlated with the downstream performance. Based on this metric, we introduce a straightforward yet effective model selection strategy and a DSE-based regularization method. Experiments on sixteen SSL methods across four benchmarks confirm that model selection improves mIoU by $3.0%$ on average with negligible computational cost. Additionally, DSE regularization consistently mitigates the effects of dense degradation. Code is available at https://github.com/EldercatSAM/SSL-Degradation.

[166] LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

ZhaoYang Han,Qihan Lin,Hao Liang,Bowen Chen,Zhou Liu,Wentao Zhang

Main category: cs.CV

TL;DR: LongInsightBench是首个专注于长视频理解的基准测试,整合了视觉、音频和文本模态,涵盖语言、观点、动作等内容,并设计了六大任务场景和严格的数据质量保证流程。实验表明,全模态模型在时间定位和长程因果推理任务中仍有挑战。

Details Motivation: 现有基准测试主要关注短视频或单模态任务,缺乏对长视频和多模态综合理解的评估。LongInsightBench填补了这一空白,旨在推动全模态模型在复杂场景下的研究。

Contribution: 1) 首个长视频多模态基准测试;2) 设计了六大任务场景和数据质量保证流程;3) 揭示了全模态模型在时间定位和因果推理中的不足。

Method: 1) 从FineVideo数据集中精选1000段长视频;2) 设计了六种任务场景,包括事件内和事件间任务;3) 采用了三步半自动化数据质量保证流程。

Result: 实验显示全模态模型在时间定位(T-Loc)和长程因果推理(CE-Caus)任务中表现不佳,多模态融合中存在信息丢失和偏见。

Insight: 长视频理解需要更精细的时间建模和多模态融合方法,当前全模态模型仍需改进以满足复杂场景的需求。

Abstract: We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models’ ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.

[167] iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA

Zhaoran Zhao,Xinli Yue,Jianhui Sun,Yuhao Xie,Tao Shao,Liangchao Yao,Fan Xia,Yuetang Deng

Main category: cs.CV

TL;DR: iDETEX是一个多模态大语言模型,专注于图像质量评估(IQA)中的详细可解释任务,通过质量定位、感知和描述三个子任务实现统一的评估范式。

Details Motivation: 图像质量评估从简单的标量预测发展为更可解释、与人类对齐的评估范式,iDETEX旨在解决详细可解释IQA的新挑战。

Contribution: 提出了iDETEX,一个统一的多模态大语言模型,能够同时执行质量定位、感知和描述三个关键任务,并在ViDA-UGC基准上达到SOTA。

Method: 设计了任务特定的离线增强模块和数据混合策略,辅以在线增强策略,充分利用多源监督。

Result: iDETEX在ICCV MIPI 2025挑战赛中排名第一,证明了其在提供准确和可解释质量评估方面的有效性。

Insight: 通过统一的多任务框架和高效的数据增强策略,iDETEX为复杂IQA任务提供了一个新的解决方案。

Abstract: Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate efficient and generalizable training across these heterogeneous subtasks, we design a suite of task-specific offline augmentation modules and a data mixing strategy. These are further complemented by online enhancement strategies to fully exploit multi-sourced supervision. We validate our approach on the large-scale ViDA-UGC benchmark, where iDETEX achieves state-of-the-art performance across all subtasks. Our model ranks first in the ICCV MIPI 2025 Detailed Image Quality Assessment Challenge, demonstrating its effectiveness and robustness in delivering accurate and interpretable quality assessments.

[168] Nearest-Class Mean and Logits Agreement for Wildlife Open-Set Recognition

Jiahao Huo,Mufhumudzi Muthivhi,Terence L. van Zyl,Fredrik Gustafsson

Main category: cs.CV

TL;DR: 该论文提出了一种后处理开放集识别方法,通过测量输入样本的特征与预测logits之间的一致性,结合最近类均值(NCM)和softmax概率的比较来实现对未知类的拒绝,性能在两个数据集上表现稳定。

Details Motivation: 当前野生动物分类模型在开放集场景中对未知类样本容易产生过度自信的预测,而现有开放集识别方法通常需要重新训练模型。论文旨在提出一种无需重新训练的后处理方法。

Contribution: 提出了一种基于NCM和logits一致性的开放集识别后处理方法,无需重新训练模型,实现了在两个数据集上的稳定性能表现。

Method: 利用输入样本到最近类均值的距离生成概率分布,并与softmax概率进行比较,测量特征空间和logit空间的一致性。

Result: 在两个数据集上排名前三,AUROC达到93.41(非洲动物)和95.35(瑞典动物),性能优于现有方法。

Insight: 特征空间和logit空间的一致性可以作为开放集识别的重要指标,无需额外训练即可实现稳定性能。

Abstract: Current state-of-the-art Wildlife classification models are trained under the closed world setting. When exposed to unknown classes, they remain overconfident in their predictions. Open-set Recognition (OSR) aims to classify known classes while rejecting unknown samples. Several OSR methods have been proposed to model the closed-set distribution by observing the feature, logit, or softmax probability space. A significant drawback of many existing approaches is the requirement to retrain the pre-trained classification model with the OSR-specific strategy. This study contributes a post-processing OSR method that measures the agreement between the models’ features and predicted logits. We propose a probability distribution based on an input’s distance to its Nearest Class Mean (NCM). The NCM-based distribution is then compared with the softmax probabilities from the logit space to measure agreement between the NCM and the classification head. Our proposed strategy ranks within the top three on two evaluated datasets, showing consistent performance across the two datasets. In contrast, current state-of-the-art methods excel on a single dataset. We achieve an AUROC of 93.41 and 95.35 for African and Swedish animals. The code can be found https://github.com/Applied-Representation-Learning-Lab/OSR.

[169] Exploring The Missing Semantics In Event Modality

Jingqian Wu,Shengpeng Xu,Yunbo Jia,Edmund Y. Lam

Main category: cs.CV

TL;DR: 论文提出了Semantic-E2VID框架,通过引入跨模态特征对齐模块和语义感知特征融合块,利用SAM模型的语义知识,显著提升了事件相机到视频的重建质量。

Details Motivation: 事件相机在低延迟、高动态范围和高效运动捕捉方面具有优势,但其事件到视频重建(E2V)任务中语义信息的缺失导致重建效果不佳。现有方法多忽视语义信息的重要性。

Contribution: 1. 提出Semantic-E2VID框架,引入跨模态特征对齐(CFA)模块和语义感知特征融合(SFF)块;2. 利用SAM模型的语义信息增强事件模态的表征;3. 提出新的语义感知E2V监督方法。

Method: 1. CFA模块将SAM模型的语义信息与事件编码器对齐;2. SFF块融合语义特征与事件表征;3. 使用SAM生成的类别标签进行语义感知监督。

Result: 在多个基准测试中,Semantic-E2VID显著提升了重建帧的质量,优于现有E2V方法。

Insight: 语义信息对事件到视频重建至关重要,跨模态语义对齐和融合能有效弥补事件模态的语义缺失。

Abstract: Event cameras offer distinct advantages such as low latency, high dynamic range, and efficient motion capture. However, event-to-video reconstruction (E2V), a fundamental event-based vision task, remains challenging, particularly for reconstructing and recovering semantic information. This is primarily due to the nature of the event camera, as it only captures intensity changes, ignoring static objects and backgrounds, resulting in a lack of semantic information in captured event modality. Further, semantic information plays a crucial role in video and frame reconstruction, yet is often overlooked by existing E2V approaches. To bridge this gap, we propose Semantic-E2VID, an E2V framework that explores the missing visual semantic knowledge in event modality and leverages it to enhance event-to-video reconstruction. Specifically, Semantic-E2VID introduces a cross-modal feature alignment (CFA) module to transfer the robust visual semantics from a frame-based vision foundation model, the Segment Anything Model (SAM), to the event encoder, while aligning the high-level features from distinct modalities. To better utilize the learned semantic feature, we further propose a semantic-aware feature fusion (SFF) block to integrate learned semantics in frame modality to form event representations with rich semantics that can be decoded by the event decoder. Further, to facilitate the reconstruction of semantic information, we propose a novel Semantic Perceptual E2V Supervision that helps the model to reconstruct semantic details by leveraging SAM-generated categorical labels. Extensive experiments demonstrate that Semantic-E2VID significantly enhances frame quality, outperforming state-of-the-art E2V methods across multiple benchmarks. The sample code is included in the supplementary material.

[170] M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

U. V. B. L Udugama,George Vosselman,Francesco Nex

Main category: cs.CV

TL;DR: M2H提出了一种基于窗口交叉任务注意力机制的轻量级多任务学习框架,用于单目图像的语义分割、深度、边缘和法线估计,显著提升了任务间一致性并保持计算效率。

Details Motivation: 现实中的边缘设备需要高效的多任务模型以支持实时空间感知,同时需利用任务间互补信息并减少计算开销。

Contribution: 提出M2H框架,引入窗口交叉任务注意力模块,结合轻量级ViT骨干网络,提升了多任务一致性和效率。

Method: 采用窗口交叉任务注意力模块,结构化了任务间的特征交换,同时保留了任务细节;骨干网络基于DINOv2优化。

Result: 在NYUDv2、Hypersim和Cityscapes数据集上超越SOTA多任务模型和单任务基线,且在笔记本电脑硬件上高效运行。

Insight: 窗口化注意力机制在多任务学习中能有效平衡特征共享与任务特异性,轻量级骨干网络适合边缘部署。

Abstract: Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

[171] Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

Vaggelis Dorovatas,Soroush Seifi,Gunshi Gupta,Rahaf Aljundi

Main category: cs.CV

TL;DR: 提出了一种无需训练的兼容视频大语言模型(Video-LLMs)的方法,通过注意力机制筛选视觉token、递归处理历史token和基于标题的问答,显著提升了流式视频处理的效率与效果。

Details Motivation: 现有Video-LLMs需要完整视频处理,但在流式场景中,长时间的视频需要在线处理和及时响应,传统方法效率低下。

Contribution: 1) 提出基于LLM注意力的视觉token选择,可丢弃95%不重要的token;2) 递归处理历史token生成连贯理解;3) 基于标题的轻量级问答。

Method: 1) 注意力机制筛选视觉token;2) 递归处理历史token;3) 基于标题的问答生成。

Result: 在流式视频基准测试中达到SOTA性能,平衡了效率与准确性。

Insight: 通过关注模型实际使用的视觉token,可以大幅减少计算开销而不显著影响性能。

Abstract: Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.

[172] Closed-Loop Transfer for Weakly-supervised Affordance Grounding

Jiajin Tang,Zhengxuan Wei,Ge Zheng,Sibei Yang

Main category: cs.CV

TL;DR: 论文提出了LoopTrans,一个闭环框架,用于双向转移外中心图像和内中心图像之间的功能知识,显著提升了功能定位的性能。

Details Motivation: 人类可以通过观察他人与物体的交互来学习如何与新物体互动。以往的工作仅从外中心图像单向转移功能知识到内中心图像,限制了其在复杂交互场景中的适用性。

Contribution: 主要贡献是提出了LoopTrans框架,实现了外中心图像与内中心图像之间的双向知识转移,并通过统一的跨模态定位和去噪知识蒸馏机制来弥合域差异。

Method: LoopTrans采用了闭环框架,包含外中心到内中心的知识转移和内中心返回外中心的知识增强,同时引入跨模态定位和去噪知识蒸馏技术。

Result: 实验表明,LoopTrans在图像和视频基准测试中均取得了显著提升,即使在人完全遮挡交互区域的挑战性场景下也能表现良好。

Insight: 闭环双向知识转移机制可以显著提升功能定位的鲁棒性和泛化能力,尤其是在复杂交互场景中。

Abstract: Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces LoopTrans, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric but also transfers back to enhance exocentric knowledge extraction. Within LoopTrans, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.

[173] Monitoring Horses in Stalls: From Object to Event Detection

Dmitrii Galimzianov,Viacheslav Vyshegorodtsev,Ivan Nezhivykh

Main category: cs.CV

TL;DR: 该研究提出了一种基于视觉的原型监测系统,结合YOLOv11和BoT-SORT技术,实现了马棚内马和人的自动检测与跟踪,并通过轨迹和空间关系推断事件状态。

Details Motivation: 监测马棚中马的行为对早期发现健康问题至关重要,但目前依赖人工,耗时耗力。

Contribution: 提出了一个结合目标检测和多目标跟踪的自动化监测系统,并构建了一个专用数据集。

Method: 使用YOLOv11进行目标检测,BoT-SORT进行多目标跟踪,基于轨迹和空间关系推断事件类型。

Result: 系统能可靠检测马相关事件,但对人的检测因数据稀缺存在局限。

Insight: 该系统为实时行为监测奠定了基础,对动物福利和马棚管理有潜在应用价值。

Abstract: Monitoring the behavior of stalled horses is essential for early detection of health and welfare issues but remains labor-intensive and time-consuming. In this study, we present a prototype vision-based monitoring system that automates the detection and tracking of horses and people inside stables using object detection and multi-object tracking techniques. The system leverages YOLOv11 and BoT-SORT for detection and tracking, while event states are inferred based on object trajectories and spatial relations within the stall. To support development, we constructed a custom dataset annotated with assistance from foundation models CLIP and GroundingDINO. The system distinguishes between five event types and accounts for the camera’s blind spots. Qualitative evaluation demonstrated reliable performance for horse-related events, while highlighting limitations in detecting people due to data scarcity. This work provides a foundation for real-time behavioral monitoring in equine facilities, with implications for animal welfare and stable management.

[174] DeepDetect: Learning All-in-One Dense Keypoints

Shaharyar Ahmed Khan Tareen,Filza Khan Tareen

Main category: cs.CV

TL;DR: DeepDetect是一种智能、全能的密集关键点检测器,通过深度学习结合传统检测器的优势,解决了现有方法对光照变化敏感、关键点密度低、适应性差等问题。

Details Motivation: 传统关键点检测器和基于学习的方法在光照变化、关键点密度、重复性和语义理解方面存在局限。DeepDetect旨在通过结合传统检测器的多样视觉线索和深度学习的优势,提供一个更鲁棒、高密度的关键点检测解决方案。

Contribution: 1. 提出DeepDetect,一种全能密集关键点检测器;2. 通过融合7种关键点和2种边缘检测器的输出生成高质量真值掩码;3. 使用轻量高效的ESPNet模型训练,实现高密度关键点检测。

Method: 1. 融合多种关键点和边缘检测器的输出生成真值掩码;2. 基于这些掩码训练ESPNet模型;3. 在多样化和视觉退化条件下实现高密度关键点检测。

Result: 在Oxford Affine Covariant Regions数据集上,DeepDetect在关键点密度(0.5143)、重复性(0.9582)和正确匹配数(59,003)上均优于其他方法。

Insight: DeepDetect通过结合传统检测器的多样性和深度学习的语义理解能力,展示了在关键点检测任务中统一方法的重要性,尤其是在视觉退化条件下的鲁棒性。

Abstract: Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).

[175] Leveraging AV1 motion vectors for Fast and Dense Feature Matching

Julien Zouein,Hossein Javidnia,François Pitié,Anil Kokaram

Main category: cs.CV

TL;DR: 该论文提出了一种利用AV1运动矢量实现快速、密集特征匹配的方法,其在压缩域前端表现媲美SIFT,但计算资源消耗更低,且匹配更密集。

Details Motivation: 传统特征匹配方法(如SIFT)在计算资源和时间上的开销较大,限制了其在大规模应用中的实用性。因此,作者希望通过压缩域的运动矢量来提供一种更高效的替代方案。

Contribution: 主要贡献是提出了一种基于AV1运动矢量的密集亚像素对应关系和短轨迹滤波方法,该方法在压缩域中实现了高效的特征匹配,且计算资源需求更低。

Method: 通过提取AV1视频编码中的运动矢量,生成密集的亚像素对应关系,并通过余弦一致性过滤噪声。该方法在压缩域前端运行,避免了传统特征提取的高开销。

Result: 实验结果表明,该方法在短视频中的表现与SIFT相当,但CPU使用率显著降低。在117帧的SfM实验中,成功重建了46万-62万个3D点,重投影误差为0.51-0.53像素。

Insight: 压缩域的运动矢量特征匹配是一种实用且高效的替代方案,尤其适用于大规模应用,展示了其在完整流程中扩展的潜力。

Abstract: We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

[176] Initialize to Generalize: A Stronger Initialization Pipeline for Sparse-View 3DGS

Feng Zhou,Wenkai Guo,Pu Cao,Zhicheng Zhang,Jianqin Yin

Main category: cs.CV

TL;DR: 论文提出了一种针对稀疏视图3D高斯溅射(3DGS)的更强初始化方法,通过改进初始点云来减轻过拟合问题,提升新视角渲染质量。

Details Motivation: 稀疏视图3DGS易过拟合训练视图,导致新视角渲染中出现模糊等问题。现有方法主要通过添加训练时约束来改进,但初始化才是关键因素。

Contribution: 论文贡献在于提出了一种新的初始化方法,包括频率感知SfM、3DGS自初始化和点云正则化,显著提升了稀疏视图3DGS的性能。

Method: 方法包括:(1)通过低频视图增强和放宽多视图匹配的频率感知SfM;(2)利用光度监督生成额外点的3DGS自初始化;(3)通过几何/可见性先验实现多视图一致性和均匀覆盖的点云正则化。

Result: 在LLFF和Mip-NeRF360数据集上的实验显示,该方法在稀疏视图设置下表现优越,验证了其作为更强初始化策略的有效性。

Insight: 初始化在稀疏视图3DGS中起决定性作用,改进初始化比训练时约束更有效。

Abstract: Sparse-view 3D Gaussian Splatting (3DGS) often overfits to the training views, leading to artifacts like blurring in novel view rendering. Prior work addresses it either by enhancing the initialization (\emph{i.e.}, the point cloud from Structure-from-Motion (SfM)) or by adding training-time constraints (regularization) to the 3DGS optimization. Yet our controlled ablations reveal that initialization is the decisive factor: it determines the attainable performance band in sparse-view 3DGS, while training-time constraints yield only modest within-band improvements at extra cost. Given initialization’s primacy, we focus our design there. Although SfM performs poorly under sparse views due to its reliance on feature matching, it still provides reliable seed points. Thus, building on SfM, our effort aims to supplement the regions it fails to cover as comprehensively as possible. Specifically, we design: (i) frequency-aware SfM that improves low-texture coverage via low-frequency view augmentation and relaxed multi-view correspondences; (ii) 3DGS self-initialization that lifts photometric supervision into additional points, compensating SfM-sparse regions with learned Gaussian centers; and (iii) point-cloud regularization that enforces multi-view consistency and uniform spatial coverage through simple geometric/visibility priors, yielding a clean and reliable point cloud. Our experiments on LLFF and Mip-NeRF360 demonstrate consistent gains in sparse-view settings, establishing our approach as a stronger initialization strategy. Code is available at https://github.com/zss171999645/ItG-GS.

[177] Split-Fuse-Transport: Annotation-Free Saliency via Dual Clustering and Optimal Transport Alignment

Muhammad Umer Ramzan,Ali Zia,Abdelwahed Khamis,Noman Ali,Usman Ali,Wei Xiang

Main category: cs.CV

TL;DR: 论文提出了一种名为POTNet的无监督显著性检测方法,通过双聚类和最优传输对齐生成高质量伪掩膜,显著提升了无监督方法的性能。

Details Motivation: 作者认为显著性检测可以接近监督方法的精度,但前提是有可靠的伪掩膜。现有原型方法未能充分利用全局一致性且原型质量不足,因此需要改进。

Contribution: 提出了POTNet方法,采用熵引导的双聚类(高熵像素用谱聚类,低熵像素用k均值)和最优传输对齐,以生成更清晰的伪掩膜。进一步设计了AutoSOD管道,无需手工先验或离线投票,实现端到端无监督训练。

Method: 1. 熵引导的双聚类头:高熵像素用谱聚类,低熵像素用k均值;
2. 通过最优传输对齐两个原型集;
3. 生成的伪掩膜监督MaskFormer式编码器-解码器。

Result: 在五个基准测试中,AutoSOD在F-measure上比无监督方法高26%,比弱监督方法高36%,显著缩小了与全监督方法的差距。

Insight: 边界像素和内部像素的几何特性不同,双聚类结合最优传输可以有效生成高质量伪掩膜,显著提升无监督显著性检测的性能。

Abstract: Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT’s single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask’s offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.

[178] Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

Yuanli Wu,Long Zhang,Yue Du,Bin Li

Main category: cs.CV

TL;DR: 论文提出了一种基于上下文感知的伪标签评分方法,用于零样本视频摘要,通过利用少量标注数据生成高质量的伪标签,并结合上下文提示提升大型语言模型(LLM)的评分稳定性,在SumMe和TVSum数据集上实现了优于无监督和零样本基线的性能。

Details Motivation: 现有监督方法依赖密集标注成本高且泛化能力有限,无监督方法难以捕捉高层语义,零样本方法对提示模板敏感且需要数据集特定的归一化。

Contribution: 1. 提出了一种基于伪标签评分的零样本视频摘要框架;2. 引入上下文感知提示,平衡局部显著性和全局一致性;3. 在SumMe和TVSum上实现了接近监督方法的性能。

Method: 1. 利用少量真实标注生成高置信度伪标签;2. 构建数据集自适应的结构化评分标准;3. 在推理时结合上下文信息评估场景的叙事性和冗余性。

Result: 在SumMe和TVSum上F1分数分别达到57.58和63.05,超越无监督和零样本基线,接近监督方法。

Insight: 伪标签评分可以稳定LLM的输出,并为零样本视频摘要提供一种通用的解释性框架。

Abstract: With the rapid proliferation of video content across social media, surveillance, and education platforms, efficiently summarizing long videos into concise yet semantically faithful surrogates has become increasingly vital. Existing supervised methods achieve strong in-domain accuracy by learning from dense annotations but suffer from high labeling costs and limited cross-dataset generalization, while unsupervised approaches, though label-free, often fail to capture high-level human semantics and fine-grained narrative cues. More recently, zero-shot prompting pipelines have leveraged large language models (LLMs) for training-free video summarization, yet remain highly sensitive to handcrafted prompt templates and dataset-specific score normalization. To overcome these limitations, we introduce a rubric-guided, pseudo-labeled prompting framework that transforms a small subset of ground-truth annotations into high-confidence pseudo labels, which are aggregated into structured, dataset-adaptive scoring rubrics guiding interpretable scene evaluation. During inference, first and last segments are scored based solely on their descriptions, whereas intermediate ones incorporate brief contextual summaries of adjacent scenes to assess narrative progression and redundancy. This contextual prompting enables the LLM to balance local salience and global coherence without parameter tuning. On SumMe and TVSum, our method achieves F1 scores of \textbf{57.58} and \textbf{63.05}, surpassing unsupervised and prior zero-shot baselines while approaching supervised performance. The results demonstrate that rubric-guided pseudo labeling effectively stabilizes LLM-based scoring and establishes a general, interpretable zero-shot paradigm for video summarization.

[179] MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

Yongshun Zhang,Zhongyi Fan,Yonghang Zhang,Zhangzikang Li,Weifeng Chen,Zhongwei Feng,Chaoyue Wang,Peng Hou,Anxiang Zeng

Main category: cs.CV

TL;DR: MUG-V 10B提出了一种高效训练大规模视频生成模型的框架,通过优化数据处理、模型架构、训练策略和基础设施四个方面,显著提升了效率和性能,并在电商视频生成任务中超越基线模型。

Details Motivation: 训练大规模视频生成模型面临文本-视频对齐、长序列处理和复杂时空依赖等挑战,亟需高效解决方案。

Contribution: 1. 提出针对视频生成模型的优化框架;2. 开源完整训练代码和模型权重;3. 在电商视频任务中表现优异。

Method: 优化数据处理、模型架构、训练策略和基础设施,利用Megatron-Core实现高效训练和近线性多节点扩展。

Result: MUG-V 10B在整体性能上匹配最新SOTA模型,在电商视频生成任务中优于开源基线。

Insight: 高效的训练框架和开源工具对推动视频生成领域发展至关重要。

Abstract: In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in \href{https://github.com/Shopee-MUG/MUG-V}{our webpage}.

[180] MambaX-Net: Dual-Input Mamba-Enhanced Cross-Attention Network for Longitudinal MRI Segmentation

Yovin Yahathugoda,Davide Prezzi,Piyalitt Ittichaiwong,Vicky Goh,Sebastien Ourselin,Michela Antonelli

Main category: cs.CV

TL;DR: MambaX-Net是一个半监督、双输入的3D分割架构,通过结合Mamba-enhanced Cross-Attention Module和Shape Extractor Module,有效解决了纵向MRI分割中的时间依赖性和标注稀缺问题。

Details Motivation: 纵向Active Surveillance(AS)中的前列腺癌监测需要多时间点的MRI分割,但现有模型通常在单时间点和专家标注数据上训练,不适用于纵向分析。MambaX-Net旨在克服这些挑战。

Contribution: 1. 提出Mamba-enhanced Cross-Attention Module,有效捕捉时间演化和空间依赖关系;2. 引入Shape Extractor Module,增强区域划分;3. 提出半监督自训练策略,减少对专家标注的依赖。

Method: 1. 双输入架构(当前MRI和前一分割掩码);2. Mamba-enhanced Cross-Attention Module结合Mamba块;3. Shape Extractor Module编码解剖信息;4. 半监督自训练策略。

Result: 在纵向AS数据集上,MambaX-Net显著优于U-Net和Transformer模型,即使数据有限且噪声较多。

Insight: MambaX-Net通过结合Mamba块的效率和半监督学习,为纵向医学图像分割提供了一种新思路。

Abstract: Active Surveillance (AS) is a treatment option for managing low and intermediate-risk prostate cancer (PCa), aiming to avoid overtreatment while monitoring disease progression through serial MRI and clinical follow-up. Accurate prostate segmentation is an important preliminary step for automating this process, enabling automated detection and diagnosis of PCa. However, existing deep-learning segmentation models are often trained on single-time-point and expertly annotated datasets, making them unsuitable for longitudinal AS analysis, where multiple time points and a scarcity of expert labels hinder their effective fine-tuning. To address these challenges, we propose MambaX-Net, a novel semi-supervised, dual-scan 3D segmentation architecture that computes the segmentation for time point t by leveraging the MRI and the corresponding segmentation mask from the previous time point. We introduce two new components: (i) a Mamba-enhanced Cross-Attention Module, which integrates the Mamba block into cross attention to efficiently capture temporal evolution and long-range spatial dependencies, and (ii) a Shape Extractor Module that encodes the previous segmentation mask into a latent anatomical representation for refined zone delination. Moreover, we introduce a semi-supervised self-training strategy that leverages pseudo-labels generated from a pre-trained nnU-Net, enabling effective learning without expert annotations. MambaX-Net was evaluated on a longitudinal AS dataset, and results showed that it significantly outperforms state-of-the-art U-Net and Transformer-based models, achieving superior prostate zone segmentation even when trained on limited and noisy data.

[181] UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Yuhao Yang,Zhen Yang,Zi-Yi Dou,Anh Nguyen,Keen You,Omar Attia,Andrew Szot,Michael Feng,Ram Ramrakhya,Alexander Toshev,Chao Huang,Yinfei Yang,Zhe Gan

Main category: cs.CV

TL;DR: UltraCUA提出了一种融合低级GUI操作和高级程序工具调用的混合动作基础模型,显著提升了计算机使用代理的性能和效率。

Details Motivation: 现有的计算机使用代理(CUAs)仅依赖低级GUI操作(如点击、输入、滚动),导致执行链冗长且容易传播错误。UltraCUA旨在通过混合动作机制将GUI操作与高级程序工具调用结合,解决这一问题。

Contribution: 1) 自动化管道从软件文档和开源代码中扩展程序工具;2) 合成数据引擎生成17,000+可验证任务;3) 大规模高质量混合动作轨迹收集;4) 两阶段训练管道结合监督学习和强化学习。

Method: 提出混合动作机制,结合低级GUI操作和高级工具调用。通过自动化管道、合成数据引擎和大规模轨迹收集构建数据集,采用监督微调和在线强化学习两阶段训练模型。

Result: UltraCUA在OSWorld上相对基准模型提升22%性能,执行速度提高11%;在WindowsAgentArena上的域外评估中成功率达到21.7%,优于基于Windows数据的基线模型。

Insight: 混合动作机制能有效减少错误传播并保持执行效率,为计算机使用代理的性能提升提供了新方向。

Abstract: Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action – seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

[182] Glyph: Scaling Context Windows via Visual-Text Compression

Jiale Cheng,Yusen Liu,Xinyu Zhang,Yulin Fei,Wenyi Hong,Ruiliang Lyu,Weihan Wang,Zhe Su,Xiaotao Gu,Xiao Liu,Yushi Bai,Jie Tang,Hongning Wang,Minlie Huang

Main category: cs.CV

TL;DR: Glyph提出了一种通过视觉-文本压缩扩展上下文窗口的方法,将长文本渲染为图像并由视觉-语言模型处理,实现了3-4倍的文本压缩,同时保持性能。

Details Motivation: 传统LLMs在处理长上下文任务时面临计算和内存成本的显著增加,Glyph通过视觉化压缩文本解决了这一问题。

Contribution: 1. 提出Glyph框架,利用视觉-语言模型压缩长文本;2. 设计LLM驱动的遗传搜索优化视觉渲染配置;3. 展示了高效的性能表现和应用潜力。

Method: 将长文本渲染为图像,使用视觉-语言模型处理,并通过遗传搜索优化视觉配置以平衡准确性和压缩效果。

Result: 实现了3-4倍的文本压缩,速度提升4倍(预填充和解码)和2倍(微调训练),并支持128K上下文扩展至1M级别。

Insight: 视觉化压缩为长上下文任务提供了一种高效且实用的解决方案,同时展示了在多模态任务中的应用潜力。

Abstract: Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

[183] PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception

Kaichen Zhou,Yuhan Wang,Grace Chen,Xinhai Chang,Gaspard Beaudouin,Fangneng Zhan,Paul Pu Liang,Mengyu Wang

Main category: cs.CV

TL;DR: PAGE-4D 是一种扩展了 VGGT 的前馈模型,专注于动态场景的相机姿态估计、深度预测和点云重建,通过动态感知掩码解耦静态和动态信息,解决了多任务冲突问题。

Details Motivation: 现有的3D前馈模型在静态场景中表现良好,但在涉及动态元素的真实场景中表现不佳。PAGE-4D旨在解决这一问题,特别是在动态区域的相机姿态估计和几何重建之间的任务冲突。

Contribution: 提出了PAGE-4D模型,能够解耦静态和动态信息,通过动态感知掩码实现多任务优化,显著提升了动态场景下的性能。

Method: 引入了动态感知聚合器,预测动态感知掩码,抑制姿态估计中的动态信息,同时在几何重建中增强这些信息。

Result: 在动态场景中,PAGE-4D在相机姿态估计、深度预测和点云重建任务中均优于VGGT。

Insight: 解耦静态和动态信息可以有效解决多任务冲突,动态感知掩码是实现这一目标的关键工具。

Abstract: Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction – all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask – suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.

[184] Expose Camouflage in the Water: Underwater Camouflaged Instance Segmentation and Dataset

Chuhong Wang,Hua Li,Chongyi Li,Huazhong Liu,Xiongxin Tang,Sam Kwong

Main category: cs.CV

TL;DR: 该论文提出了首个水下伪装实例分割数据集UCIS4K,并基于Segment Anything Model设计了一个新网络UCIS-SAM,包含三个关键模块,用于提升水下伪装目标的分割性能。

Details Motivation: 水下环境存在颜色失真、低对比度和模糊等问题,传统的地面伪装实例分割方法在水下场景表现不佳,因此需要专门的数据集和方法来解决这一问题。

Contribution: 1. 提出首个水下伪装实例分割数据集UCIS4K;2. 设计UCIS-SAM网络,包含CBOM、FDTIM和MFFAM三个模块,显著提升水下伪装目标的分割精度。

Method: UCIS-SAM网络包含三个模块:CBOM优化通道特征,FDTIM增强目标内在特征并减少伪装干扰,MFFAM通过多频段特征聚合改善低对比目标的边界分割。

Result: 在UCIS4K和公共基准测试上的实验表明,UCIS-SAM优于现有的先进方法。

Insight: 水下伪装目标的分割需要专门的数据集和网络设计,尤其是关注通道平衡、频域特征和多尺度信息融合。

Abstract: With the development of underwater exploration and marine protection, underwater vision tasks are widespread. Due to the degraded underwater environment, characterized by color distortion, low contrast, and blurring, camouflaged instance segmentation (CIS) faces greater challenges in accurately segmenting objects that blend closely with their surroundings. Traditional camouflaged instance segmentation methods, trained on terrestrial-dominated datasets with limited underwater samples, may exhibit inadequate performance in underwater scenes. To address these issues, we introduce the first underwater camouflaged instance segmentation (UCIS) dataset, abbreviated as UCIS4K, which comprises 3,953 images of camouflaged marine organisms with instance-level annotations. In addition, we propose an Underwater Camouflaged Instance Segmentation network based on Segment Anything Model (UCIS-SAM). Our UCIS-SAM includes three key modules. First, the Channel Balance Optimization Module (CBOM) enhances channel characteristics to improve underwater feature learning, effectively addressing the model’s limited understanding of underwater environments. Second, the Frequency Domain True Integration Module (FDTIM) is proposed to emphasize intrinsic object features and reduce interference from camouflage patterns, enhancing the segmentation performance of camouflaged objects blending with their surroundings. Finally, the Multi-scale Feature Frequency Aggregation Module (MFFAM) is designed to strengthen the boundaries of low-contrast camouflaged instances across multiple frequency bands, improving the model’s ability to achieve more precise segmentation of camouflaged objects. Extensive experiments on the proposed UCIS4K and public benchmarks show that our UCIS-SAM outperforms state-of-the-art approaches.

[185] Integrating BIM and UAV-based photogrammetry for Automated 3D Structure Model Segmentation

Siqi Chen,Shanyue Guan

Main category: cs.CV

TL;DR: 这篇论文提出了一种结合BIM和无人机(UAV)摄影测量的机器学习框架,用于自动化分割3D结构模型中的组件,解决了传统手动标记效率低下的问题。

Details Motivation: 无人机和摄影测量技术可以高效捕获高分辨率扫描数据并重建3D基础设施模型,但手动分割结构组件费时且易错。

Contribution: 1. 提出基于机器学习的自动化分割框架;2. 结合无人机扫描的真实点云与BIM生成的合成数据,减少对人工标记的依赖;3. 验证了该框架在铁路轨道数据集上的高精度分割效果。

Method: 利用无人机采集的真实点云数据与BIM生成的合成数据互补训练机器学习模型,实现3D点云的自动化分割。

Result: 在铁路轨道数据集上验证了模型的高分割精度,并通过结合小规模数据与BIM数据显著减少了训练时间。

Insight: 该框架展示了UAV与BIM技术结合的潜力,提升了3D基础设施模型分割的自动化水平,为结构健康监测和基础设施管理提供了高效工具。

Abstract: The advancement of UAV technology has enabled efficient, non-contact structural health monitoring. Combined with photogrammetry, UAVs can capture high-resolution scans and reconstruct detailed 3D models of infrastructure. However, a key challenge remains in segmenting specific structural components from these models-a process traditionally reliant on time-consuming and error-prone manual labeling. To address this issue, we propose a machine learning-based framework for automated segmentation of 3D point clouds. Our approach uses the complementary strengths of real-world UAV-scanned point clouds and synthetic data generated from Building Information Modeling (BIM) to overcome the limitations associated with manual labeling. Validation on a railroad track dataset demonstrated high accuracy in identifying and segmenting major components such as rails and crossties. Moreover, by using smaller-scale datasets supplemented with BIM data, the framework significantly reduced training time while maintaining reasonable segmentation accuracy. This automated approach improves the precision and efficiency of 3D infrastructure model segmentation and advances the integration of UAV and BIM technologies in structural health monitoring and infrastructure management.

[186] One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection

Jia Guo,Shuai Lu,Lei Fan,Zelin Li,Donglin Di,Yang Song,Weihang Zhang,Wenbing Zhu,Hong Yan,Fang Chen,Huiqi Li,Hongen Liao

Main category: cs.CV

TL;DR: Dinomaly2 是一个统一的框架,首次实现了全谱图像无监督异常检测(UAD),在多类模型上弥补了性能差距,并扩展到多种数据模态和任务设置。

Details Motivation: 现有多类无监督异常检测模型性能远不如单一类模型,且领域内方法碎片化,需要一种统一的解决方案以简化部署。

Contribution: 提出了Dinomaly2,第一个全谱图像UAD统一框架,在多类任务中表现优异,并可扩展到多种模态和任务设置。

Method: 基于‘少即是多’的理念,通过协调五个简单元素的标准重建框架实现高性能,无需修改即可扩展到多样化任务。

Result: 在12个UAD基准测试中表现卓越,例如多类模型在MVTec-AD和VisA上分别达到99.9%和99.3%的I-AUROC。

Insight: 方法论的简洁性是实现真正普适性的基础,Dinomaly2展示了统一框架在多任务和多模态中的潜力。

Abstract: Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models, yet existing multi-class models significantly underperform the most advanced one-for-one counterparts. Moreover, the field has fragmented into specialized methods tailored to specific scenarios (multi-class, 3D, few-shot, etc.), creating deployment barriers and highlighting the need for a unified solution. In this paper, we present Dinomaly2, the first unified framework for full-spectrum image UAD, which bridges the performance gap in multi-class models while seamlessly extending across diverse data modalities and task settings. Guided by the “less is more” philosophy, we demonstrate that the orchestration of five simple element achieves superior performance in a standard reconstruction-based framework. This methodological minimalism enables natural extension across diverse tasks without modification, establishing that simplicity is the foundation of true universality. Extensive experiments on 12 UAD benchmarks demonstrate Dinomaly2’s full-spectrum superiority across multiple modalities (2D, multi-view, RGB-3D, RGB-IR), task settings (single-class, multi-class, inference-unified multi-class, few-shot) and application domains (industrial, biological, outdoor). For example, our multi-class model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively. For multi-view and multi-modal inspection, Dinomaly2 demonstrates state-of-the-art performance with minimum adaptations. Moreover, using only 8 normal examples per class, our method surpasses previous full-shot models, achieving 98.7% and 97.4% I-AUROC on MVTec-AD and VisA. The combination of minimalistic design, computational scalability, and universal applicability positions Dinomaly2 as a unified solution for the full spectrum of real-world anomaly detection applications.

[187] Self-supervised Pre-training for Mapping of Archaeological Stone Wall in Historic Landscapes Using High-Resolution DEM Derivatives

Zexian Huang,Mashnoon Islam,Brian Armstrong,Kourosh Khoshelham,Martin Tomko

Main category: cs.CV

TL;DR: 论文提出DINO-CV框架,通过自监督预训练策略和高分辨率DEM衍生物,解决了干石墙在植被覆盖区域难以识别和标注数据稀缺的问题,实现了自动化的干石墙分割。

Details Motivation: 干石墙具有重要的遗产和环境价值,但由于植被遮挡和标注数据稀缺,传统方法难以高效识别。

Contribution: 1) 提出DINO-CV框架,结合DEM衍生物克服植被遮挡;2) 设计自监督跨视角预训练策略缓解数据稀缺问题;3) 支持多种视觉骨干网络的迁移学习。

Method: 1) 使用高分辨率LiDAR衍生的DEM获取地形结构;2) 通过知识蒸馏实现自监督跨视角预训练;3) 结合ResNet、Wide ResNet和Vision Transformers等多种骨干网络。

Result: 在Budj Bim文化景观测试中,DINO-CV达到68.6%的mIoU,仅用10%标注数据微调后仍保持63.8%的mIoU。

Insight: 高分辨率DEM衍生物和自监督学习能有效解决植被遮挡和数据稀缺问题,为遗产环境的自动化分割提供了新思路。

Abstract: Dry-stone walls hold significant heritage and environmental value. Mapping these structures is essential for ecosystem preservation and wildfire management in Australia. Yet, many walls remain unidentified due to their inaccessibility and the high cost of manual mapping. Deep learning-based segmentation offers a scalable solution, but two major challenges persist: (1) visual occlusion of low-lying walls by dense vegetation, and (2) limited labeled data for supervised training. We propose DINO-CV, a segmentation framework for automatic mapping of low-lying dry-stone walls using high-resolution Airborne LiDAR-derived digital elevation models (DEMs). DEMs overcome visual occlusion by capturing terrain structures hidden beneath vegetation, enabling analysis of structural rather than spectral cues. DINO-CV introduces a self-supervised cross-view pre-training strategy based on knowledge distillation to mitigate data scarcity. It learns invariant visual and geometric representations across multiple DEM derivatives, supporting various vision backbones including ResNet, Wide ResNet, and Vision Transformers. Applied to the UNESCO World Heritage cultural landscape of Budj Bim, Victoria, the method identifies one of Australia’s densest collections of colonial dry-stone walls beyond Indigenous heritage contexts. DINO-CV achieves a mean Intersection over Union (mIoU) of 68.6% on test areas and maintains 63.8% mIoU when fine-tuned with only 10% labeled data. These results demonstrate the potential of self-supervised learning on high-resolution DEM derivatives for automated dry-stone wall mapping in vegetated and heritage-rich environments with scarce annotations.

[188] Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs

Sébastien Thuau,Siba Haidar,Ayush Bajracharya,Rachid Chelouah

Main category: cs.CV

TL;DR: 该论文比较了两种节俭的联邦学习方法(LoRA微调的视觉语言模型和个性化训练的3D CNN)用于暴力检测,强调能效和环境指标,并提出混合模型方案。

Details Motivation: 研究旨在探索高效能的联邦学习方法,以解决暴力检测中的非独立同分布数据问题和能源消耗问题。

Contribution: 首次对比了LoRA微调的视觉语言模型和个性化CNN在联邦学习中的表现,并量化了能源和碳排放,提出混合模型方案。

Method: 比较两种方法:(i) 零样本和联邦微调的视觉语言模型(LLaVA-7B),(ii) 个性化训练的3D CNN(65.8M参数)。评估精度、校准和能耗。

Result: 两种方法均超过90%准确率,CNN在AUC和log loss上略优且能耗更低,VLM在上下文推理和多模态推理中表现更佳。

Insight: 轻量CNN适合常规分类,VLM适用于复杂场景;混合方案在资源感知AI中更具可持续性。

Abstract: We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.

[189] PICABench: How Far Are We from Physically Realistic Image Editing?

Yuandong Pu,Le Zhuo,Songhao Han,Jinbo Xing,Kaiwen Zhu,Shuo Cao,Bin Fu,Si Liu,Hongsheng Li,Yu Qiao,Wenlong Zhang,Xi Chen,Yihao Liu

Main category: cs.CV

TL;DR: PICABench是一个全新的基准测试,系统评估了图像编辑在物理真实性方面的表现,覆盖了光学、力学和状态转换等多个子维度,并提出了一种可靠的评估协议PICAEval。作者还探索了通过视频学习物理效应的解决方案,并构建了一个训练数据集PICA-100K。

Details Motivation: 现有图像编辑模型和基准测试主要关注指令完成度,而忽略了物理效应的真实性(如阴影、反射等)。为了填补这一空白,作者提出了PICABench,旨在推动图像编辑从简单的内容操作向物理一致的逼真性发展。

Contribution: 1. 提出了PICABench基准测试,系统评估图像编辑的物理真实性;2. 设计了PICAEval评估协议,结合VLM-as-a-judge和人类标注;3. 探索了从视频学习物理效应的解决方案,并构建了PICA-100K数据集。

Method: 1. 定义了八个子维度(光学、力学等)来评估物理真实性;2. 使用VLM-as-a-judge协议进行评测,并结合人类标注;3. 通过视频学习物理效应,构建了PICA-100K数据集。

Result: 评测结果表明,现有模型在物理真实性方面仍有很大改进空间,尤其是在复杂物理效应的处理上。

Insight: 物理效应是图像编辑逼真性的关键因素,但现有技术仍落后于预期。从视频学习物理效应是一种潜在的有效解决方案。

Abstract: Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.

[190] Intelligent Communication Mixture-of-Experts Boosted-Medical Image Segmentation Foundation Model

Xinwei Zhang,Hu Chen,Zhe Yuan,Sukun Tian,Peng Feng

Main category: cs.CV

TL;DR: 该论文提出了IC-MoE模型,通过混合专家(MoE)机制和语义引导的对比学习方法,弥补了医学图像分割基础模型在高级特征表示和预训练权重结构完整性方面的不足。实验表明其性能优于现有SOTA模型。

Details Motivation: 现有医学图像分割基础模型的微调方法存在高级特征表示不足和破坏预训练权重结构完整性的问题。这些问题限制了模型的性能。

Contribution: 1. 结合混合专家(MoE)机制和自适应投票策略,增强了高级特征表示能力;2. 提出语义引导对比学习方法,解决了传统对比学习中弱监督的问题。

Method: 1. 构建了基础专家、语义专家和自适应专家,并通过像素概率自适应投票策略实现专家选择和融合;2. 采用语义引导对比学习方法,进一步提升特征表示能力。

Result: 在三个公共医学图像分割数据集上的实验表明,IC-MoE的性能优于其他SOTA模型,并展示了优秀的泛化能力。

Insight: IC-MoE通过结合MoE机制和对比学习,不仅增强了模型的特征表示能力,还保留了预训练权重的结构完整性,为医学图像分割任务提供了更高效的解决方案。

Abstract: Foundation models for medical image segmentation have achieved remarkable performance. Adaptive fine-tuning of natural image segmentation foundation models is crucial for medical image segmentation tasks. However, some limitations exist in existing fine-tuning methods: 1) insufficient representation of high-level features and 2) the fine-tuning process disrupts the structural integrity of pretrained weights. Inspired by these critical problems, we propose an intelligent communication mixture-of-experts boosted-medical image segmentation foundation model, named IC-MoE, with twofold ideas: 1) We construct basic experts, semantic experts, and adaptive experts. Moreover, we implement a pixel probability adaptive voting strategy, which enables expert selection and fusion through label consistency and load balancing. This approach preliminarily enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. 2) We propose a semantic-guided contrastive learning method to address the issue of weak supervision in contrastive learning. This method further enhances the representation capability of high-level features while preserving the structural integrity of pretrained weights. Extensive experiments across three public medical image segmentation datasets demonstrate that the IC-MoE outperforms other SOTA models. Consequently, the proposed IC-MoE effectively supplements foundational medical image segmentation models with high-level features and pretrained structural integrity. We also validate the superior generalizability of the IC-MoE across diverse medical image segmentation scenarios.

[191] Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning

Min Cao,Xinyu Zhou,Ding Jiang,Bo Du,Mang Ye,Min Zhang

Main category: cs.CV

TL;DR: 论文提出了一种多语言文本到图像行人检索(TIPR)任务,并构建了一个多语言TIPR基准数据集。通过双向隐式关系推理与对齐框架(Bi-IRRA)解决了模态异质性和语言多样性问题,实现了新的SOTA性能。

Details Motivation: 现有TIPR方法主要针对英语场景,忽视多语言需求,且全局方法忽略细粒度差异,局部方法依赖先验信息。本文旨在解决这些问题。

Contribution: 1. 提出多语言TIPR任务并构建基准数据集;2. 提出Bi-IRRA框架,通过双向隐式关系推理和多维全局对齐模块实现语言和模态的对齐。

Method: Bi-IRRA框架包含双向隐式关系推理模块(通过双向掩码预测增强局部关系建模)和多维全局对齐模块(解决模态异质性)。

Result: 在多个多语言TIPR数据集上实现了SOTA性能。

Insight: 多语言TIPR任务的提出扩展了应用的普适性;双向隐式关系推理避免了依赖先验信息,更灵活地建模跨模态关系。

Abstract: Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in https://github.com/Flame-Chasers/Bi-IRRA.

[192] Towards 3D Objectness Learning in an Open World

Taichi Liu,Zhenyu Wang,Ruofeng Liu,Guang Wang,Desheng Zhang

Main category: cs.CV

TL;DR: 该论文提出了OP3Det,一种无需手工文本提示的开放世界3D检测器,通过融合2D语义先验和3D几何先验,实现对新对象的检测,显著优于现有方法。

Details Motivation: 现有3D目标检测器在开放世界场景中泛化能力不足,且传统方法难以处理新对象的检测。论文旨在学习泛化的3D客观性,以检测包括训练中未见过的新对象在内的所有对象。

Contribution: 提出OP3Det,首个无需文本提示的开放世界3D检测器;引入2D基础模型的泛化能力,结合多模态特征动态路由,提升3D目标发现能力。

Method: 通过2D语义和3D几何先验生成类别无关的目标提议;在多模态专家混合中动态融合点云和RGB图像特征,学习泛化的3D客观性。

Result: OP3Det在开放世界3D检测任务中显著优于现有方法,AR指标提升16.0%,比封闭世界检测器性能提升13.5%。

Insight: 结合2D基础模型的泛化能力和多模态特征融合,是提升开放世界3D目标检测的有效途径。

Abstract: Recent advancements in 3D object detection and novel category detection have made significant progress, yet research on learning generalized 3D objectness remains insufficient. In this paper, we delve into learning open-world 3D objectness, which focuses on detecting all objects in a 3D scene, including novel objects unseen during training. Traditional closed-set 3D detectors struggle to generalize to open-world scenarios, while directly incorporating 3D open-vocabulary models for open-world ability struggles with vocabulary expansion and semantic overlap. To achieve generalized 3D object discovery, We propose OP3Det, a class-agnostic Open-World Prompt-free 3D Detector to detect any objects within 3D scenes without relying on hand-crafted text prompts. We introduce the strong generalization and zero-shot capabilities of 2D foundation models, utilizing both 2D semantic priors and 3D geometric priors for class-agnostic proposals to broaden 3D object discovery. Then, by integrating complementary information from point cloud and RGB image in the cross-modal mixture of experts, OP3Det dynamically routes uni-modal and multi-modal features to learn generalized 3D objectness. Extensive experiments demonstrate the extraordinary performance of OP3Det, which significantly surpasses existing open-world 3D detectors by up to 16.0% in AR and achieves a 13.5% improvement compared to closed-world 3D detectors.

[193] Elastic ViTs from Pretrained Models without Retraining

Walter Simoncini,Michael Dorkenwald,Tijmen Blankevoort,Cees G. M. Snoek,Yuki M. Asano

Main category: cs.CV

TL;DR: SnapViT是一种无需重新训练的Vision Transformer剪枝方法,能够动态调整计算预算,结合梯度和网络结构相关性,无需标签数据或重新训练,性能优于现有方法。

Details Motivation: 现有视觉基础模型尺寸固定,无法灵活适应实际计算资源限制,因而需要一种无需重新训练的弹性剪枝方法。

Contribution: 1. 提出高效剪枝策略;2. 新颖的Hessian非对角结构进化逼近方法;3. 自监督重要性评分机制。

Method: 结合梯度信息和跨网络结构相关性,通过进化算法逼近Hessian结构,无需标签数据或重新训练。

Result: 在DINO、SigLIPv2等模型上表现优异,生成弹性模型仅需5分钟(单A100 GPU)。

Insight: 无需重新训练或标签数据即可实现高性能剪枝,显著减少部署成本。

Abstract: Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/

[194] Automatic Classification of Circulating Blood Cell Clusters based on Multi-channel Flow Cytometry Imaging

Suqiang Ma,Subhadeep Sengupta,Yao Lee,Beikang Gu,Xianyan Chen,Xianqiao Wang,Yang Liu,Mengjia Xu,Galit H. Frydman,He Li

Main category: cs.CV

TL;DR: 该论文提出了一种基于多通道流式细胞术图像自动分类循环血细胞团(CCCs)的框架,通过两步策略实现高精度分类和表型识别。

Details Motivation: 循环血细胞团(CCCs)是血栓、感染和炎症等疾病的重要生物标志物。但目前缺乏自动分析CCC图像的工具,尤其是其不规则形状和异质性细胞类型增加了分析的复杂性。

Contribution: 提出了一种新的计算框架,结合YOLOv11模型和多通道荧光标记技术,实现了CCC图像的自动分类和细胞类型识别。

Method: 1. 使用YOLOv11模型对图像进行分类(细胞团与非细胞团);2. 通过多通道荧光标记叠加轮廓区域,识别细胞类型。

Result: 在CCC分类和表型识别中达到95%以上的准确率。

Insight: 该框架不仅适用于血细胞分析,还有潜力扩展到免疫细胞和肿瘤细胞团的研究,为多种疾病的细胞研究提供支持。

Abstract: Circulating blood cell clusters (CCCs) containing red blood cells (RBCs), white blood cells(WBCs), and platelets are significant biomarkers linked to conditions like thrombosis, infection, and inflammation. Flow cytometry, paired with fluorescence staining, is commonly used to analyze these cell clusters, revealing cell morphology and protein profiles. While computational approaches based on machine learning have advanced the automatic analysis of single-cell flow cytometry images, there is a lack of effort to build tools to automatically analyze images containing CCCs. Unlike single cells, cell clusters often exhibit irregular shapes and sizes. In addition, these cell clusters often consist of heterogeneous cell types, which require multi-channel staining to identify the specific cell types within the clusters. This study introduces a new computational framework for analyzing CCC images and identifying cell types within clusters. Our framework uses a two-step analysis strategy. First, it categorizes images into cell cluster and non-cluster groups by fine-tuning the You Only Look Once(YOLOv11) model, which outperforms traditional convolutional neural networks (CNNs), Vision Transformers (ViT). Then, it identifies cell types by overlaying cluster contours with regions from multi-channel fluorescence stains, enhancing accuracy despite cell debris and staining artifacts. This approach achieved over 95% accuracy in both cluster classification and phenotype identification. In summary, our automated framework effectively analyzes CCC images from flow cytometry, leveraging both bright-field and fluorescence data. Initially tested on blood cells, it holds potential for broader applications, such as analyzing immune and tumor cell clusters, supporting cellular research across various diseases.

[195] MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Yaning Pan,Zekun Wang,Qianqian Xie,Yongqian Wen,Yuanxing Zhang,Guohui Zhang,Haoxuan Hu,Zhiyu Pan,Yibing Huang,Zhidong Gan,Yonghong Lin,An Ping,Tianhao Peng,Jiaheng Liu

Main category: cs.CV

TL;DR: MT-Video-Bench是一个全新的视频理解评测基准,旨在评估多模态大语言模型(MLLMs)在多轮对话中的表现,弥补了现有评测基准仅关注单轮问答的不足。

Details Motivation: 现有的评测基准主要集中在单轮问答任务上,忽视了多轮对话在真实场景中的复杂性。因此,需要一个全面的评测基准来评估MLLMs在多轮对话中的视频理解能力。

Contribution: 提出了MT-Video-Bench,一个专注于多轮对话的视频理解评测基准,涵盖了987个精心设计的多轮对话,评估六项核心能力(感知性和交互性)。

Method: 通过从多样领域收集视频数据,设计多轮对话任务,评估MLLMs在感知和交互方面的能力。

Result: 评测了多个开源和闭源的MLLMs,揭示了它们在处理多轮视频对话时的性能差距和局限性。

Insight: 多轮对话评测对MLLMs的能力提出了更高要求,MT-Video-Bench为未来研究提供了重要参考。

Abstract: The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI’s ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

[196] Can Image-To-Video Models Simulate Pedestrian Dynamics?

Aaron Appelle,Jerome P. Lynch

Main category: cs.CV

TL;DR: 本文研究了基于扩散变换器(DiT)的图像到视频(I2V)模型是否能够模拟真实的行人动态,通过对其生成的视频进行评估。

Details Motivation: 近年来,基于DiT的I2V模型在大型视频数据集上表现出强大的世界建模能力。作者想要探究这些模型是否能生成拥挤公共场景中真实的行人运动模式。

Contribution: 提出了一个框架,通过将I2V模型与行人轨迹基准的关键帧结合,评估其在模拟行人动态方面的性能。

Method: 使用行人轨迹基准的关键帧作为条件输入I2V模型,生成视频后,通过量化指标评估行人动态的预测性能。

Result: 结果表明I2V模型在模拟行人动态方面具有一定的潜力,但仍需进一步优化。

Insight: I2V模型不仅可以用于视频生成,还能在特定场景(如行人动态模拟)中展现其世界建模能力。

Abstract: Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.

[197] Towards Explainable Skin Cancer Classification: A Dual-Network Attention Model with Lesion Segmentation and Clinical Metadata Fusion

Md. Enamul Atiq,Shaikh Anowarul Fattah

Main category: cs.CV

TL;DR: 该论文提出了一种基于双编码器注意力的框架,结合病灶分割和临床元数据,以提高皮肤癌分类的准确性和可解释性。通过Deep-UNet结构和注意力机制,模型在分割和分类任务中表现出色,并通过热力图验证了模型的可靠性。

Details Motivation: 皮肤癌的早期检测对患者预后至关重要,但现有深度学习模型多为‘黑盒’,缺乏可解释性。此外,病灶的高类内变异性和低类间差异性增加了分类难度。

Contribution: 1. 提出了一种双编码器注意力框架,结合病灶分割和临床元数据。2. 设计了带有DAG和ASPP的Deep-UNet结构用于病灶分割。3. 通过多头部交叉注意力融合特征,并结合患者元数据。4. 模型在HAM10000和ISIC数据集上表现优异,并通过Grad-CAM验证了可解释性。

Method: 1. 使用Deep-UNet进行病灶分割,结合DAG和ASPP。2. 分类阶段采用双DenseNet201编码器,通过多头部交叉注意力融合特征。3. 结合临床元数据的Transformer模块。

Result: 在HAM10000、ISIC 2018和2019数据集上,该模型在分割和分类任务中达到SOTA性能,显著提升了分类准确率和AUC。

Insight: 1. 病灶分割与临床数据的融合提高了模型的可解释性和准确性。2. 注意力机制和多模态融合能有效聚焦病灶区域,避免背景干扰。

Abstract: Skin cancer is a life-threatening disease where early detection significantly improves patient outcomes. Automated diagnosis from dermoscopic images is challenging due to high intra-class variability and subtle inter-class differences. Many deep learning models operate as “black boxes,” limiting clinical trust. In this work, we propose a dual-encoder attention-based framework that leverages both segmented lesions and clinical metadata to enhance skin lesion classification in terms of both accuracy and interpretability. A novel Deep-UNet architecture with Dual Attention Gates (DAG) and Atrous Spatial Pyramid Pooling (ASPP) is first employed to segment lesions. The classification stage uses two DenseNet201 encoders-one on the original image and another on the segmented lesion whose features are fused via multi-head cross-attention. This dual-input design guides the model to focus on salient pathological regions. In addition, a transformer-based module incorporates patient metadata (age, sex, lesion site) into the prediction. We evaluate our approach on the HAM10000 dataset and the ISIC 2018 and 2019 challenges. The proposed method achieves state-of-the-art segmentation performance and significantly improves classification accuracy and average AUC compared to baseline models. To validate our model’s reliability, we use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps. These visualizations confirm that our model’s predictions are based on the lesion area, unlike models that rely on spurious background features. These results demonstrate that integrating precise lesion segmentation and clinical data with attention-based fusion leads to a more accurate and interpretable skin cancer classification model.

[198] SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Samir Khaki,Junxian Guo,Jiaming Tang,Shang Yang,Yukang Chen,Konstantinos N. Plataniotis,Yao Lu,Song Han,Zhijian Liu

Main category: cs.CV

TL;DR: SparseVILA是一种新的高效视觉语言模型(VLM)推理范式,通过解耦视觉稀疏性在预填充和解码阶段的处理,显著提升了推理速度,同时保持了多轮对话的准确性。

Details Motivation: 现有的视觉语言模型在处理高分辨率图像、长视频和多轮对话时,视觉标记数量激增导致推理延迟增加,限制了其可扩展性。

Contribution: 提出了SparseVILA,一种解耦视觉稀疏性的训练无关、架构无关的框架,显著加速VLM推理而不牺牲能力。

Method: SparseVILA在预填充阶段剪枝冗余视觉标记,在解码阶段仅检索与查询相关的标记,从而解耦稀疏性处理。结合AWQ优化的推理流水线,实现了高效的多模态推理。

Result: 在长上下文视频任务中,预填充速度提升4倍,解码速度提升2.5倍,端到端速度提升2.6倍,同时在文档理解和推理任务中提高了准确性。

Insight: 通过解耦查询无关的剪枝和查询感知的检索,SparseVILA为高效多模态推理开辟了新方向,无需重新训练即可应用于现有大型VLM。

Abstract: Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks – while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.

[199] ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Zixin Yin,Ling-Hao Chen,Lionel Ni,Xili Dai

Main category: cs.CV

TL;DR: 该论文提出了ConsistEdit方法,一种针对MM-DiT模型的免训练视觉编辑技术,解决了现有方法在编辑强度和一致性之间的权衡问题,支持多轮和多区域编辑,并通过实验验证其优越性能。

Details Motivation: 现有免训练注意力控制方法在文本引导编辑中存在编辑强度与一致性难以平衡的问题,尤其是在多轮和视频编辑中误差会累积,且缺乏对细粒度属性的单独修改能力。

Contribution: 1. 提出ConsistEdit方法,针对MM-DiT设计了专用的注意力控制机制;2. 首次实现了跨所有推理步骤和注意力层的免手工编辑;3. 支持渐进式结构一致性调整和多区域编辑。

Method: 1. 纯视觉注意力控制;2. 掩码引导的前注意力融合;3. 对查询、键、值令牌的差异化操作。

Result: 实验表明,ConsistEdit在多种图像和视频编辑任务中表现最优,显著提升了可靠性和一致性。

Insight: 1. MM-DiT的架构改进为注意力控制提供了新的潜力;2. 跨层和渐进式控制是实现高质量编辑的关键。

Abstract: Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.

q-fin.ST [Back]

[200] Comparing LLMs for Sentiment Analysis in Financial Market News

Lucas Eduardo Pereira Teles,Carlos M. S. Figueiredo

Main category: q-fin.ST

TL;DR: 该论文比较了大型语言模型(LLMs)与经典方法在金融新闻情感分析任务中的表现,结果显示LLMs在多数情况下优于经典模型。

Details Motivation: 金融新闻情感分析对市场决策至关重要,研究旨在量化LLMs与经典方法在该任务中的性能差异。

Contribution: 比较了LLMs与经典方法在金融新闻情感分析中的表现,证明了LLMs的优势。

Method: 采用对比研究方法,量化不同模型在金融新闻情感分析任务中的性能指标。

Result: LLMs在金融新闻情感分析中大多数情况下表现优于经典模型。

Insight: LLMs在特定领域(如金融)的自然语言处理任务中展现出强大潜力,值得进一步探索。

Abstract: This article presents a comparative study of large language models (LLMs) in the task of sentiment analysis of financial market news. This work aims to analyze the performance difference of these models in this important natural language processing task within the context of finance. LLM models are compared with classical approaches, allowing for the quantification of the benefits of each tested model or approach. Results show that large language models outperform classical models in the vast majority of cases.

cs.AI [Back]

[201] ScholarEval: Research Idea Evaluation Grounded in Literature

Hanane Nour Moussa,Patrick Queiroz Da Silva,Daniel Adu-Ampratwum,Alyson East,Zitong Lu,Nikki Puccetti,Mingyi Xue,Huan Sun,Bodhisattwa Prasad Majumder,Sachin Kumar

Main category: cs.AI

TL;DR: ScholarEval是一个基于文献检索的研究想法评估框架,通过两个核心标准(soundness和contribution)评估研究想法的质量和创新性。其表现优于基线模型,并且在专家标注数据集Schol arIdeas上表现出更高的覆盖率和实用性。

Details Motivation: 随着AI工具在研究构思中的普及,需要一种可靠的评估方法来验证生成的研究想法的有效性和实用性。

Contribution: 提出了ScholarEval框架,并发布了首个多领域专家标注的研究想法数据集ScholarIdeas。

Method: 利用检索增强技术评估研究想法的soundness(基于文献的实证有效性)和contribution(相对于先前研究的创新程度)。

Result: ScholarEval在覆盖率和实用性上显著优于基线模型,包括OpenAI的o4-mini-deep-research。

Insight: 通过结合文献检索和多维度评估,ScholarEval为研究想法的自动评估提供了更全面和可靠的解决方案。

Abstract: As AI tools become increasingly common for research ideation, robust evaluation is critical to ensure the validity and usefulness of generated ideas. We introduce ScholarEval, a retrieval augmented evaluation framework that assesses research ideas based on two fundamental criteria: soundness - the empirical validity of proposed methods based on existing literature, and contribution - the degree of advancement made by the idea across different dimensions relative to prior research. To evaluate ScholarEval, we introduce ScholarIdeas, the first expert-annotated dataset of multi-domain research ideas and reviews, comprised of 117 ideas across four disciplines: artificial intelligence, neuroscience, biochemistry, and ecology. Our evaluation shows that ScholarEval achieves significantly higher coverage of points mentioned in the human expert annotated rubrics in ScholarIdeas compared to all baselines. Furthermore, ScholarEval is consistently preferred over our strongest baseline o4-mini-deep-research, a reasoning and search-enabled agentic system by OpenAI, in terms of evaluation actionability, depth, and evidence support. Our large-scale user study also shows that ScholarEval significantly outperforms deep research in literature engagement, idea refinement, and usefulness. We openly release our code, dataset, and ScholarEval tool for the community to use and build on.

[202] A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

Minhua Lin,Zongyu Wu,Zhichao Xu,Hui Liu,Xianfeng Tang,Qi He,Charu Aggarwal,Hui Liu,Xiang Zhang,Suhang Wang

Main category: cs.AI

TL;DR: 该论文是第一篇全面综述强化学习(RL)驱动的智能搜索的论文,重点关注其在检索增强生成(RAG)中的角色、优化策略和应用范围。

Details Motivation: 大语言模型(LLMs)存在静态知识、事实幻觉和无法获取实时信息的问题,传统RAG方法缺乏自适应的检索与推理控制。智能搜索结合RL提供了改进的方向。

Contribution: 论文首次系统综述了RL驱动的智能搜索方法,从功能角色、优化策略和应用范围三个维度组织内容,并总结了代表性方法、评估协议和应用案例。

Method: 基于强化学习的智能搜索框架,包括多步交互的规划、检索和反思,并结合RL的自适应学习能力优化搜索行为。

Result: 论文梳理了RL在智能搜索中的应用现状,提出了评估协议和应用案例,并总结了未来的研究方向和技术挑战。

Insight: 强化学习为智能搜索提供了自适应和自我优化的潜力,未来研究方向包括如何构建可靠且可扩展的RL驱动智能搜索系统。

Abstract: The advent of large language models (LLMs) has transformed information access and reasoning through open-ended natural language interaction. However, LLMs remain limited by static knowledge, factual hallucinations, and the inability to retrieve real-time or domain-specific information. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs in external evidence, but traditional RAG pipelines are often single turn and heuristic, lacking adaptive control over retrieval and reasoning. Recent advances in agentic search address these limitations by enabling LLMs to plan, retrieve, and reflect through multi-step interaction with search environments. Within this paradigm, reinforcement learning (RL) offers a powerful mechanism for adaptive and self-improving search behavior. This survey provides the first comprehensive overview of \emph{RL-based agentic search}, organizing the emerging field along three complementary dimensions: (i) What RL is for (functional roles), (ii) How RL is used (optimization strategies), and (iii) Where RL is applied (scope of optimization). We summarize representative methods, evaluation protocols, and applications, and discuss open challenges and future directions toward building reliable and scalable RL driven agentic search systems. We hope this survey will inspire future research on the integration of RL and agentic search. Our repository is available at https://github.com/ventr1c/Awesome-RL-based-Agentic-Search-Papers.

[203] End-to-end Listen, Look, Speak and Act

Siyin Wang,Wenyi Yu,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Lu Lu,Chao Zhang

Main category: cs.AI

TL;DR: ELLSA是首个全双工、端到端的多模态模型,通过SA-MoE架构同时感知和生成视觉、文本、语音和动作,实现更自然的人机交互。

Details Motivation: 人类交互是多模态且全双工的,现有模型难以模拟这种能力,需要一种统一架构以实现跨模态的自然交互。

Contribution: 提出SA-MoE架构,实现多模态的联合感知与并发生成,支持高级交互行为(如对话轮换、动作插入等)。

Method: 采用SA-MoE架构,将各模态路由到专用专家模块,并通过统一的自注意力主干融合,减少模态干扰。

Result: 在语音交互和机器人操作任务中表现优异,支持多种高级交互行为,性能与单模态基线相当。

Insight: 统一的跨模态架构是迈向通用人工智能的重要一步,为自然交互提供了可扩展的解决方案。

Abstract: Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.

[204] See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models

Shuo Han,Yukun Cao,Zezhong Ding,Zengyi Gao,S Kevin Zhou,Xike Xie

Main category: cs.AI

TL;DR: GraphVista是一个统一的框架,通过层次化组织图信息和引入规划代理,解决了视觉语言模型在图理解中的可扩展性和模态协调问题,显著优于现有方法。

Details Motivation: 视觉语言模型(VLMs)在图理解中存在输入令牌限制、扩展瓶颈以及文本与视觉模态协调不足的问题,GraphVista旨在解决这些挑战。

Contribution: 1) 提出GraphVista框架,增强了图理解的可扩展性和模态协调;2) 引入轻量级GraphRAG基础和规划代理,优化任务分配与信息检索。

Method: 1) 使用GraphRAG层次化组织图信息,仅检索任务相关的文本描述和高分辨率视觉子图;2) 规划代理根据任务复杂性选择最适合的模态(文本或视觉)。

Result: GraphVista支持扩展到比现有基准大200倍的图,并在性能上比现有方法提升4.4倍。

Insight: 通过结合文本和视觉模态的互补优势,GraphVista展示了在多模态图理解任务中的高效性和灵活性。

Abstract: Vision-language models (VLMs) have shown promise in graph understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that routes tasks to the most suitable modality-using the text modality for simple property reasoning and the visual modality for local and structurally complex reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to $200\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to $4.4\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.

[205] VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

Kangrui Wang,Pingyue Zhang,Zihan Wang,Yaning Gao,Linjie Li,Qineng Wang,Hanyang Chen,Chi Wan,Yiping Lu,Zhengyuan Yang,Lijuan Wang,Ranjay Krishna,Jiajun Wu,Li Fei-Fei,Yejin Choi,Manling Li

Main category: cs.AI

TL;DR: VAGEN框架通过强化学习强化了视觉语言模型(VLM)代理的世界模型推理能力,通过状态估计和转移建模实现多轮视觉任务的高效推理。

Details Motivation: 视觉语言模型代理面临从文本状态到复杂视觉观察的转换挑战,需要解决部分可观测性和世界建模问题。

Contribution: 1. 提出了一种通过RL架构强制和奖励代理推理过程的方法;2. 揭示了任务依赖的内部信念表示(自然语言与结构化格式);3. 设计了世界建模奖励和Bi-Level GAE方法。

Method: 将代理推理分解为状态估计和转移建模,并通过RL(POMDP框架)强化推理过程;设计了Bi-Level GAE进行信用分配。

Result: 3B参数模型在五项代理基准测试中得分0.82,远超未训练模型(0.21)和GPT-5等专有模型。

Insight: 不同任务需要不同的内部信念表示形式;强化学习和任务分解是多轮视觉代理推理的关键。

Abstract: A key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent’s reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent’s reasoning into State Estimation (“what is the current state?”) and Transition Modeling (“what comes next?”) is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task-dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn-level supervision for accurate state prediction, and introduce Bi-Level General Advantage Estimation (Bi-Level GAE) for turn-aware credit assignment. Through this form of visual state reasoning, a 3B-parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3$\times$ improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT-5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents in diverse visual environments. Code and data are publicly available at https://vagen-ai.github.io.

[206] Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users

Melik Ozolcer,Sang Won Bae

Main category: cs.AI

TL;DR: 论文研究了基于真实用户的LLM健康教练系统,通过离线策略评估(OPE)发现,统一的工具密集型策略虽然提升了平均表现,但对某些用户群体(如健康素养低但自我效能高的用户)有害。模拟实验表明,早期添加信息增益奖励可缩短特质识别时间,提高目标达成率。建议冻结生成器,学习基于分组的决策头,并针对不同群体报告指标以避免平均结果的误导。

Details Motivation: 研究目的是评估多轮LLM健康教练在真实用户中的表现,探索如何通过离线策略优化决策策略,避免对特定用户群体的潜在危害。

Contribution: 1. 在真实用户中评估工具增强的LLM健康教练;2. 发现统一策略对不同用户群体的不一致影响;3. 提出通过模拟和分组指标优化个性化决策的方法。

Method: 1. 使用离线策略评估(OPE)分析工具密集型策略的表现;2. 构建轻量级模拟器,测试信息增益奖励对特质识别和目标达成的影响。

Result: 统一策略虽然提升平均表现,但损害特定用户群体;模拟实验显示信息增益奖励能有效缩短特质识别时间并提高目标达成率。

Insight: 个性化决策需关注分组指标,避免平均结果的误导;早期信息增益奖励是优化LLM健康教练的有效手段。

Abstract: We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.

[207] MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

Mir Nafis Sharear Shopnil,Sharad Duwal,Abhishek Tyagi,Adiba Mahbub Proma

Main category: cs.AI

TL;DR: MIRAGE是一个多模态虚假信息检测框架,通过分步推理和网络检索,显著提升了检测性能。

Details Motivation: 虚假信息通过多模态内容快速传播,传统监督模型需要领域特定数据且泛化能力差,亟需无需标注数据的检测方法。

Contribution: 提出MIRAGE框架,将多模态验证分解为四个模块,结合视觉语言模型和网络检索,显著提升了检测性能。

Method: 分步模块化方法:视觉真实性评估、跨模态一致性分析、检索增强的事实检查、校准判别。结合GPT-4o-mini和网络检索。

Result: 在MMFakeBench验证集上F1达81.65%,准确率75.1%,优于零样本基线7.65分;测试集结果相似(81.44% F1)。

Insight: 分步推理和网络检索可替代领域特定数据,显著提升多模态虚假信息检测性能。

Abstract: Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.

[208] Reasoning Distillation and Structural Alignment for Improved Code Generation

Amir Jalilifard,Anderson de Rezende Rocha,Marcos Medeiros Raimundo

Main category: cs.AI

TL;DR: 该论文提出了一种通过知识蒸馏将大型语言模型的推理能力迁移到小型模型的方法,并结合结构对齐优化损失函数,显著提升了代码生成的性能。

Details Motivation: 代码生成任务需要模型不仅理解提示的意图,还能通过算法推理生成正确的解决方案。大型语言模型具备这种推理能力,但小型模型往往缺乏。因此,作者希望将大型模型的推理能力迁移到小型模型中,以提升其性能并降低成本。

Contribution: 1. 提出了一种结合知识蒸馏和结构对齐的方法,将大型语言模型的推理能力迁移到小型模型;2. 引入结构感知损失优化,使模型不仅能生成正确的token序列,还能理解解决方案的整体结构。

Method: 1. 通过知识蒸馏训练小型模型,模仿大型模型的推理和问题解决能力;2. 设计结构感知损失函数,建立问题定义与解决方案之间的结构对应关系。

Result: 在MBPP、MBPP Plus和HumanEval基准测试中,该方法在pass@1、平均数据流和平均语法匹配等指标上显著优于基线模型。

Insight: 1. 通过蒸馏和结构对齐,小型模型可以具备接近大型模型的推理能力;2. 结构感知损失优化能有效提升模型对解决方案整体结构的理解。

Abstract: Effective code generation with language models hinges on two critical factors: accurately understanding the intent of the prompt and generating code that applies algorithmic reasoning to produce correct solutions capable of passing diverse test cases while adhering to the syntax of the target programming language. Unlike other language tasks, code generation requires more than accurate token prediction; it demands comprehension of solution-level and structural relationships rather than merely generating the most likely tokens. very large language model (VLLM) are capable of generating detailed steps toward the correct solution of complex tasks where reasoning is crucial in solving the problem. Such reasoning capabilities may be absent in smaller language models. Therefore, in this work, we distill the reasoning capabilities of a VLLM into a smaller, more efficient model that is faster and cheaper to deploy. Our approach trains the model to emulate the reasoning and problem-solving abilities of the VLLM by learning to identify correct solution pathways and establishing a structural correspondence between problem definitions and potential solutions through a novel method of structure-aware loss optimization. This enables the model to transcend token-level generation and to deeply grasp the overarching structure of solutions for given problems. Experimental results show that our fine-tuned model, developed through a cheap and simple to implement process, significantly outperforms our baseline model in terms of pass@1, average data flow, and average syntax match metrics across the MBPP, MBPP Plus, and HumanEval benchmarks.

[209] LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

Qingchuan Yang,Simon Mahns,Sida Li,Anri Gu,Jibang Wu,Haifeng Xu

Main category: cs.AI

TL;DR: 本文提出了一种新范式’LLM-as-a-Prophet’,通过构建Prophet Arena评估基准,系统研究了LLMs在预测未来事件中的能力。研究发现LLMs在某些方面表现优异,但也存在关键瓶颈。

Details Motivation: 预测是社会科学和金融等领域的重要课题,LLMs在海量数据上的训练使其具备预测未来事件的潜力,本文旨在探索这一潜力。

Contribution: 提出了’LLM-as-a-Prophet’范式,并构建了Prophet Arena评估基准,首次系统评估了LLMs在真实世界事件预测中的表现。

Method: 通过Prophet Arena收集实时预测任务,将任务分解为多个阶段,并进行控制实验和大规模评估。

Result: LLMs在预测中表现出色(如校准误差小、预测置信度一致),但也存在事件召回不准确、对数据源理解不足等瓶颈。

Insight: LLMs在预测任务中潜力巨大,但需改进信息聚合能力和数据理解能力,才能与市场预测相抗衡。

Abstract: Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call “LLM-as-a-Prophet”. This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs’ inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.

[210] Contextual Attention Modulation: Towards Efficient Multi-Task Adaptation in Large Language Models

Dayan Pan,Zhaoyang Fu,Jingyuan Wang,Xiao Han,Yue Zhu,Xiangyu Zhao

Main category: cs.AI

TL;DR: 这篇论文提出了一种新的机制——上下文注意力调制(CAM),用于解决大语言模型(LLMs)在多任务适应中的问题,通过动态调整自注意力模块的表征来实现高效的多任务适应。结合动态路由策略的HyCAM框架显著提升了性能。

Details Motivation: 大语言模型在多任务适应中存在知识保留与任务专门化的平衡问题,传统方法容易产生灾难性遗忘且资源消耗大,现有的参数高效方法在复杂多任务场景下表现不佳。

Contribution: 提出了上下文注意力调制(CAM)机制,并进一步设计了HyCAM框架,结合共享和专用CAM模块及动态路由策略,实现了高效的多任务适应。

Method: CAM动态调制自注意力模块的表征,HyCAM框架结合共享全参数CAM模块和多个轻量级专用CAM模块,通过动态路由策略自适应融合知识。

Result: 在问答、代码生成和逻辑推理等异构任务上的实验表明,HyCAM平均性能提升3.65%,显著优于现有方法。

Insight: CAM机制通过动态调整注意力特征,实现了任务专门化与知识保留的平衡;HyCAM框架的多模块结构和动态路由策略为多任务适应提供了新的思路。

Abstract: Large Language Models (LLMs) possess remarkable generalization capabilities but struggle with multi-task adaptation, particularly in balancing knowledge retention with task-specific specialization. Conventional fine-tuning methods suffer from catastrophic forgetting and substantial resource consumption, while existing parameter-efficient methods perform suboptimally in complex multi-task scenarios. To address this, we propose Contextual Attention Modulation (CAM), a novel mechanism that dynamically modulates the representations of self-attention modules in LLMs. CAM enhances task-specific features while preserving general knowledge, thereby facilitating more effective and efficient adaptation. For effective multi-task adaptation, CAM is integrated into our Hybrid Contextual Attention Modulation (HyCAM) framework, which combines a shared, full-parameter CAM module with multiple specialized, lightweight CAM modules, enhanced by a dynamic routing strategy for adaptive knowledge fusion. Extensive experiments on heterogeneous tasks, including question answering, code generation, and logical reasoning, demonstrate that our approach significantly outperforms existing approaches, achieving an average performance improvement of 3.65%. The implemented code and data are available to ease reproducibility at https://github.com/Applied-Machine-Learning-Lab/HyCAM.

[211] Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

Zhining Liu,Ziyi Chen,Hui Liu,Chen Luo,Xianfeng Tang,Suhang Wang,Joy Zeng,Zhenwei Dai,Zhan Shi,Tianxin Wei,Benoit Dumoulin,Hanghang Tong

Main category: cs.AI

TL;DR: 本文探讨了视觉语言模型(VLMs)在视觉问答任务中感知与推理脱节的现象,并提出了一种无需训练的注意力干预方法以提升模型性能。

Details Motivation: 尽管VLMs在多模态任务中表现优异,但它们经常在视觉证据存在的情况下仍给出错误答案。本文旨在探究这种错误的根源是感知不足还是证据利用不足。

Contribution: 1. 揭示了VLMs中‘看而不信’的现象;2. 提出了一种基于选择性注意力掩码的推理时干预方法;3. 验证了该方法在多个主流VLM家族中的有效性。

Method: 通过逐层分析注意力动态,发现浅层关注文本而深层稀疏但可靠地关注视觉证据区域。利用注意力掩码干预深层关注区域,以显式强化证据利用。

Result: 该方法在LLaVA、Qwen等VLMs中显著提升了准确性,表明模型内部编码了可靠证据但未充分利用。

Insight: VLMs的失败更多源于推理阶段对证据的利用不足,而非感知不足;显式强化证据可以弥合感知与推理的差距。

Abstract: Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing’’ that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.

cs.HC [Back]

[212] Detecting and Preventing Harmful Behaviors in AI Companions: Development and Evaluation of the SHIELD Supervisory System

Ziv Ben-Zion,Paul Raffelhüschen,Max Zettl,Antonia Lüönd,Achim Burrer,Philipp Homan,Tobias R Spiller

Main category: cs.HC

TL;DR: 论文介绍了SHIELD监督系统,用于检测和预防AI伴侣中的有害行为,通过LLM技术显著减少不当内容,同时保持正常互动的有效性。

Details Motivation: AI伴侣在日常生活中日益普及,但现有安全系统往往忽视早期有害行为,可能导致不健康的情感依赖或社交隔离。SHIELD旨在填补这一空白。

Contribution: 开发了SHIELD系统,专注于检测和缓解情感依赖等五种有害行为维度,并通过开源提供材料支持研究和部署。

Method: SHIELD基于LLM技术,采用特定系统提示检测五种有害行为,并通过100项合成对话基准评估性能。

Result: SHIELD将不当内容从10-16%降至3-8%,敏感性为59%,特异性为95%,同时保留95%的正常互动。

Insight: 研究表明透明、可部署的监督系统能有效应对AI伴侣中的情感操纵问题,并为未来研究提供了开源工具。

Abstract: AI companions powered by large language models (LLMs) are increasingly integrated into users’ daily lives, offering emotional support and companionship. While existing safety systems focus on overt harms, they rarely address early-stage problematic behaviors that can foster unhealthy emotional dynamics, including over-attachment or reinforcement of social isolation. We developed SHIELD (Supervisory Helper for Identifying Emotional Limits and Dynamics), a LLM-based supervisory system with a specific system prompt that detects and mitigates risky emotional patterns before escalation. SHIELD targets five dimensions of concern: (1) emotional over-attachment, (2) consent and boundary violations, (3) ethical roleplay violations, (4) manipulative engagement, and (5) social isolation reinforcement. These dimensions were defined based on media reports, academic literature, existing AI risk frameworks, and clinical expertise in unhealthy relationship dynamics. To evaluate SHIELD, we created a 100-item synthetic conversation benchmark covering all five dimensions of concern. Testing across five prominent LLMs (GPT-4.1, Claude Sonnet 4, Gemma 3 1B, Kimi K2, Llama Scout 4 17B) showed that the baseline rate of concerning content (10-16%) was significantly reduced with SHIELD (to 3-8%), a 50-79% relative reduction, while preserving 95% of appropriate interactions. The system achieved 59% sensitivity and 95% specificity, with adaptable performance via prompt engineering. This proof-of-concept demonstrates that transparent, deployable supervisory systems can address subtle emotional manipulation in AI companions. Most development materials including prompts, code, and evaluation methods are made available as open source materials for research, adaptation, and deployment.

[213] HealthDial: A No-Code LLM-Assisted Dialogue Authoring Tool for Healthcare Virtual Agents

Farnaz Nouraei,Zhuorui Yong,Timothy Bickmore

Main category: cs.HC

TL;DR: HealthDial是一个无需编程的对话创作工具,利用大语言模型(LLMs)为医疗保健虚拟代理生成多轮对话内容,并通过无代码界面支持编辑,确保内容安全有效。

Details Motivation: 医疗保健提供者和教育者需要一种工具,能够快速、安全地创建虚拟代理对话,用于患者健康教育和咨询。

Contribution: 开发了HealthDial工具,结合LLMs的能力和无代码界面,使非技术用户也能高效创建和验证虚拟代理对话。

Method: 利用LLMs自动从文本材料生成初始对话计划,并通过有限状态机输出内容,确保可验证性和安全性。作者可通过无代码界面编辑对话。

Result: 可行性研究表明,HealthDial能帮助用户有效覆盖健康材料内容,生成清晰且可操作的虚拟代理对话。

Insight: LLMs在医疗对话创作中潜力巨大,但需要工具支持以确保内容安全和用户控制。

Abstract: We introduce HealthDial, a dialogue authoring tool that helps healthcare providers and educators create virtual agents that deliver health education and counseling to patients over multiple conversations. HealthDial leverages large language models (LLMs) to automatically create an initial session-based plan and conversations for each session using text-based patient health education materials as input. Authored dialogue is output in the form of finite state machines for virtual agent delivery so that all content can be validated and no unsafe advice is provided resulting from LLM hallucinations. LLM-drafted dialogue structure and language can be edited by the author in a no-code user interface to ensure validity and optimize clarity and impact. We conducted a feasibility and usability study with counselors and students to test our approach with an authoring task for cancer screening education. Participants used HealthDial and then tested their resulting dialogue by interacting with a 3D-animated virtual agent delivering the dialogue. Through participants’ evaluations of the task experience and final dialogues, we show that HealthDial provides a promising first step for counselors to ensure full coverage of their health education materials, while creating understandable and actionable virtual agent dialogue with patients.

[214] Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation

Hendric Voss,Lisa Michelle Bohnenkamp,Stefan Kopp

Main category: cs.HC

TL;DR: 该研究比较了两种共语音手势生成框架(AQ-GT及其语义增强版AQ-GT-a)的性能,发现缺乏显式语义输入的AQ-GT在训练域内更有效,而AQ-GT-a在泛化性上表现更好,尤其是在新场景中表现形状和大小。

Details Motivation: 探索语义标注对手势生成的影响,以及人类如何感知这些手势的含义和自然度。

Contribution: 提出并评估了两种共语音手势生成框架,揭示了语义标注与性能之间的复杂关系,并指出语义增强的局限性。

Method: 使用AQ-GT和AQ-GT-a框架,结合SAGA语料库中的句子、上下文相似句子和新颖动作句子,进行用户评估。

Result: AQ-GT在训练域内更有效,AQ-GT-a泛化性更强但未提升人类感知的自然度。

Insight: 显式语义增强不一定能改善手势生成,其效果高度依赖上下文,存在专业化与泛化性的权衡。

Abstract: This study explores two frameworks for co-speech gesture generation, AQ-GT and its semantically-augmented variant AQ-GT-a, to evaluate their ability to convey meaning through gestures and how humans perceive the resulting movements. Using sentences from the SAGA spatial communication corpus, contextually similar sentences, and novel movement-focused sentences, we conducted a user-centered evaluation of concept recognition and human-likeness. Results revealed a nuanced relationship between semantic annotations and performance. The original AQ-GT framework, lacking explicit semantic input, was surprisingly more effective at conveying concepts within its training domain. Conversely, the AQ-GT-a framework demonstrated better generalization, particularly for representing shape and size in novel contexts. While participants rated gestures from AQ-GT-a as more expressive and helpful, they did not perceive them as more human-like. These findings suggest that explicit semantic enrichment does not guarantee improved gesture generation and that its effectiveness is highly dependent on the context, indicating a potential trade-off between specialization and generalization.

[215] ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input

Hendric Voss,Stefan Kopp

Main category: cs.HC

TL;DR: 这篇论文提出了一种零样本生成系统ImaGGen,能够从语言和图像输入中生成与语义相关的协同语音手势,解决了现有方法只能生成简单节奏手势的局限性。

Details Motivation: 现有协同语音手势生成方法仅能生成伴随说话节奏的简单手势,难以表达语义信息。该研究旨在生成能够补充语言表达语义的图标或指示手势,提升虚拟代理或Avatar的表达能力。

Contribution: 1. 提出了首个零样本生成系统,结合语言和图像输入生成语义手势。2. 设计了图像分析管道和语义匹配模块,提取视觉特征并与语言关联。3. 结合逆运动学引擎生成语义手势和自然节奏手势。

Method: 1. 使用图像分析管道提取物体形状、对称性等关键特征。2. 通过语义匹配模块将视觉细节与语言输入关联。3. 利用逆运动学引擎合成语义和节奏手势,实现多模态表达。

Result: 用户研究表明,在手势支持下,参与者对模糊语言的语义理解显著提升,验证了手势的可解释性和沟通价值。

Insight: 语义手势对提升虚拟代理的表达能力和人机交互效率具有重要意义,尤其在复杂形状表现上仍有改进空间。

Abstract: Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts key object properties such as shape, symmetry, and alignment, together with a semantic matching module that links these visual details to spoken text. An inverse kinematics engine then synthesizes iconic and deictic gestures and combines them with co-generated natural beat gestures for coherent multimodal communication. A comprehensive user study demonstrates the effectiveness of our approach. In scenarios where speech alone was ambiguous, gestures generated by our system significantly improved participants’ ability to identify object properties, confirming their interpretability and communicative value. While challenges remain in representing complex shapes, our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars, marking a substantial step forward towards efficient and robust, embodied human-agent interaction. More information and example videos are available here: https://review-anon-io.github.io/ImaGGen.github.io/

cs.MA [Back]

[216] Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis

Wonduk Seo,Juhyeon Lee,Junseo Koh,Hyunjin An,Jian Park,Seunghyun Lee,Haihua Chen,Yi Bu

Main category: cs.MA

TL;DR: MA-SAPO是一种基于多代理框架的提示优化方法,通过将评估结果与结构化推理结合,实现系统化的提示编辑,提供透明且可控的优化过程。

Details Motivation: 现有提示优化方法依赖试错和黑盒评估,缺乏解释性和可控性。MA-SAPO旨在通过多代理协作生成可解释的推理链,改进提示优化的透明度和效果。

Contribution: 1) 提出MA-SAPO框架,通过多代理协作生成可解释的提示优化;2) 引入两阶段(推理阶段和测试阶段)优化方法,保存和复用推理资产;3) 在实验中优于单次提示和现有多代理策略。

Method: MA-SAPO分为推理阶段和测试阶段:推理阶段中代理协作解释评分、诊断弱点并生成优化建议;测试阶段则复用推理资产进行证据驱动的提示编辑。

Result: 在HelpSteer1/2基准测试中,MA-SAPO表现优于单次提示、检索增强基线及现有多代理方法。

Insight: 1) 通过结构化推理提高提示优化的透明性;2) 多代理协作能够有效捕捉复杂优化需求;3) 推理资产的复用提升了优化的可控性和效率。

Abstract: Prompt optimization has emerged as an effective alternative to retraining for improving the performance of Large Language Models (LLMs). However, most existing approaches treat evaluation as a black box, relying solely on numerical scores while offering limited insight into why a prompt succeeds or fails. They also depend heavily on trial-and-error refinements, which are difficult to interpret and control. In this paper, we introduce MA-SAPO, a Multi-Agent framework for Score-Aware Prompt Optimization. Compared to prior methods, MA-SAPO explicitly couples evaluation outcomes with structured reasoning to guide systematic edits. The framework specifically consists of two stages: during the Reasoning Phase, agents collaboratively explain metric scores, diagnose weaknesses, and synthesize targeted refinements that are stored as reusable reasoning assets; during the Test Phase, agents retrieve these assets to analyze optimized prompts and apply only evidence-grounded edits. By turning evaluation signals into interpretable reasoning chains, MA-SAPO produces prompt refinements that are more transparent, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks demonstrate consistent improvements over single-pass prompting, retrieval-augmented baselines, and prior multi-agent strategies, validating the effectiveness of our approach.

cs.CR [Back]

[217] PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation

Zheng Hui,Yijiang River Dong,Sanhanat Sivapiromrat,Ehsan Shareghi,Nigel Collier

Main category: cs.CR

TL;DR: PrivacyPAD是一个基于强化学习的框架,旨在动态平衡用户查询中的隐私保护和任务性能,通过智能路由文本块实现最优隐私-效用权衡。

Details Motivation: 用户在使用大语言模型(LLM)时常面临隐私泄露风险,传统静态方法破坏了语言连贯性且不分青红皂白地移除敏感信息。

Contribution: 提出了PrivacyPAD,一个基于强化学习的动态隐私保护框架,并引入了高PII密度的医疗数据集验证其有效性。

Method: 采用强化学习训练代理动态路由文本块,区分可替换PII和任务关键PII,实现隐私与性能的最优平衡。

Result: 在隐私-效用权衡上达到了新SOTA,证明了自适应策略在敏感环境中的必要性。

Insight: 动态策略比静态方法更适合隐私保护场景,任务关键PII的战略性处理能显著提升性能。

Abstract: When users submit queries to Large Language Models (LLMs), their prompts can often contain sensitive data, forcing a difficult choice: Send the query to a powerful proprietary LLM providers to achieving state-of-the-art performance and risk data exposure, or relying on smaller, local models guarantees data privacy but often results in a degradation of task performance. Prior approaches have relied on static pipelines that use LLM rewriting, which shatters linguistic coherence and indiscriminately removes privacy-sensitive information, including task-critical content. We reformulate this challenge (Privacy-Conscious Delegation) as a sequential decision-making problem and introduce a novel reinforcement learning (RL) framework called PrivacyPAD to solve it. Our framework trains an agent to dynamically route text chunks, learning a policy that optimally balances the trade-off between privacy leakage and task performance. It implicitly distinguishes between replaceable Personally Identifiable Information (PII) (which it shields locally) and task-critical PII (which it strategically sends to the remote model for maximal utility). To validate our approach in complex scenarios, we also introduce a new medical dataset with high PII density. Our framework achieves a new state-of-the-art on the privacy-utility frontier, demonstrating the necessity of learned, adaptive policies for deploying LLMs in sensitive environments.

[218] VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

Qilin Liao,Anamika Lochab,Ruqi Zhang

Main category: cs.CR

TL;DR: VERA-V是一个基于变分推断的框架,用于发现视觉-语言模型(VLMs)的多模态越狱漏洞,通过联合学习文本-图像提示的后验分布,生成隐蔽的对抗输入。

Details Motivation: 现有的多模态红队方法依赖脆弱的模板,仅关注单一攻击场景,且暴露的漏洞有限。VERA-V旨在更全面地探索VLMs的潜在安全漏洞。

Contribution: 提出了VERA-V框架,通过变分推断联合学习文本-图像提示的后验分布,生成高效的对抗样本,并结合了文本提示、图像合成和注意力分散三种策略。

Method: 采用变分推断框架学习后验分布,训练轻量级攻击器生成多样化的对抗提示,同时结合文本嵌入有害信号、扩散模型生成对抗图像和结构化干扰分散VLM注意力。

Result: 在HarmBench和HADES基准测试中,VERA-V显著优于现有方法,攻击成功率(ASR)在GPT-4o上比最佳基线高出53.75%。

Insight: VLMs的多模态设计引入了新的漏洞,而联合学习文本和图像的对抗提示可以更有效地绕过模型防御。

Abstract: Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

[219] ISO/IEC-Compliant Match-on-Card Face Verification with Short Binary Templates

Abdelilah Ganmati,Karim Afdel,Lahcen Koutti

Main category: cs.CR

TL;DR: 这篇论文提出了一种实用的卡上匹配设计,用于人脸验证,通过PCA-ITQ生成紧凑的64/128位模板,并在卡上通过恒定时间汉明距离进行比较。设计符合ISO/IEC标准,具有固定长度的负载和无分数泄露的状态字,实验结果表明其在不同比特率和误接受率下表现优异。

Details Motivation: 当前的卡上人脸验证系统存在模板尺寸大、计算复杂度高和隐私泄漏的问题。本文旨在设计一种符合ISO/IEC标准的解决方案,使用短二进制模板和恒定时间匹配,以提高效率并保护隐私。

Contribution: 1. 引入PCA-ITQ生成的64/128位紧凑模板;2. 设计符合ISO/IEC标准的APDU命令和固定负载;3. 提出恒定时间汉明距离匹配方法;4. 展示了在低比特率下的高效性能。

Method: 1. 使用PCA-ITQ生成短二进制模板;2. 在卡上通过恒定时间汉明距离进行匹配;3. 设计固定负载的APDU命令和无分数泄露的状态字;4. 通过实验验证性能和隐私保护能力。

Result: 实验结果表明,64位和128位模板在FAR=1%时TPR达到0.836,128位模板的EER更低。在不同比特率下,验证时间显著低于传统方法。

Insight: 短二进制模板和恒定时间匹配可以在保证性能的同时满足隐私保护需求。未来的研究方向包括多数据集评估和硬件级时间优化。

Abstract: We present a practical match-on-card design for face verification in which compact 64/128-bit templates are produced off-card by PCA-ITQ and compared on-card via constant-time Hamming distance. We specify ISO/IEC 7816-4 and 14443-4 command APDUs with fixed-length payloads and decision-only status words (no score leakage), together with a minimal per-identity EEPROM map. Using real binary codes from a CelebA working set (55 identities, 412 images), we (i) derive operating thresholds from ROC/DET, (ii) replay enroll->verify transactions at those thresholds, and (iii) bound end-to-end time by pure link latency plus a small constant on-card budget. Even at the slowest contact rate (9.6 kbps), total verification time is 43.9 ms (64 b) and 52.3 ms (128 b); at 38.4 kbps both are <14 ms. At FAR = 1%, both code lengths reach TPR = 0.836, while 128 b lowers EER relative to 64 b. An optional +6 B helper (targeted symbol-level parity over empirically unstable bits) is latency-negligible. Overall, short binary templates, fixed-payload decision-only APDUs, and constant-time matching satisfy ISO/IEC transport constraints with wide timing margin and align with ISO/IEC 24745 privacy goals. Limitations: single-dataset evaluation and design-level (pre-hardware) timing; we outline AgeDB/CFP-FP and on-card microbenchmarks as next steps.

[220] Patronus: Safeguarding Text-to-Image Models against White-Box Adversaries

Xinfeng Li,Shengyuan Pang,Jialin Wu,Jiangyi Deng,Huanlong Zhong,Yanjiao Chen,Jie Zhang,Wenyuan Xu

Main category: cs.CR

TL;DR: Patronus是一个针对白盒对抗者的防御框架,通过内部调节器和非可微调学习机制保护文本到图像模型的安全性。

Details Motivation: 现有文本到图像模型的安全性措施在白盒对抗者(如微调攻击)下失效,亟需一种全面的防御机制。

Contribution: 提出了Patronus框架,包含内部调节器和非可微调学习机制,有效抵御白盒对抗者的攻击。

Method: 1.内部调节器将不安全输入特征解码为零向量;2.设计非可微调学习机制增强模型对齐。

Result: 实验验证了Patronus在安全内容生成上的性能完整性,以及对不安全内容的有效拒绝和抗微调攻击的鲁棒性。

Insight: 通过特征解码和模型对齐的结合,提供了对白盒对抗者的全面防御思路。

Abstract: Text-to-image (T2I) models, though exhibiting remarkable creativity in image generation, can be exploited to produce unsafe images. Existing safety measures, e.g., content moderation or model alignment, fail in the presence of white-box adversaries who know and can adjust model parameters, e.g., by fine-tuning. This paper presents a novel defensive framework, named Patronus, which equips T2I models with holistic protection to defend against white-box adversaries. Specifically, we design an internal moderator that decodes unsafe input features into zero vectors while ensuring the decoding performance of benign input features. Furthermore, we strengthen the model alignment with a carefully designed non-fine-tunable learning mechanism, ensuring the T2I model will not be compromised by malicious fine-tuning. We conduct extensive experiments to validate the intactness of the performance on safe content generation and the effectiveness of rejecting unsafe content generation. Results also confirm the resilience of Patronus against various fine-tuning attacks by white-box adversaries.

cs.RO [Back]

[221] What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics

Lennart Wachowiak,Andrew Coles,Gerard Canal,Oya Celiktutan

Main category: cs.RO

TL;DR: 论文介绍了一个包含1,893个用户问题的数据集,涵盖12类70个子类别,揭示了家用机器人需回答的问题类型,并发现新手和资深用户问题的差异。

Details Motivation: 随着大语言模型和对话接口在机器人交互中的普及,机器人回答用户问题的能力愈发重要。现有研究多聚焦于解释性问题的‘为什么’,而忽略了多样化的用户问题需求。

Contribution: 提供了一个多样化的用户问题数据集;揭示了用户对机器人问题类型的偏好;发现了新手与资深用户在提问内容上的差异。

Method: 通过15个视频和7个文本刺激材料,收集100名参与者对家用机器人在不同情境下的提问,整理成数据集并分类。

Result: 数据集显示,用户最常问的问题是任务执行细节(22.5%)和机器人能力(12.7%);新手更关注简单事实,资深用户则更关注复杂情境处理问题。

Insight: 机器人需扩展问题回答范围,不仅限于‘为什么’;设计对话接口时需考虑用户背景差异;数据集可用于优化机器人日志和问题回答模块。

Abstract: With the growing use of large language models and conversational interfaces in human-robot interaction, robots’ ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why-questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios – thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (22.5%), the robot’s capabilities (12.7%), and performance assessments (11.3%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question-answering modules, and (iii) designing explanation strategies that align with user expectations.

[222] NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly?

Jierui Peng,Yanyan Zhang,Yicheng Duan,Tuo Liang,Vipin Chaudhary,Yu Yin

Main category: cs.RO

TL;DR: NEBULA提出了一种统一的标准评估生态系统,用于精确诊断VLA(视觉-语言-动作)代理的技能缺陷和稳健性,并通过双轴评估协议(能力测试和压力测试)揭示了传统端到端任务成功指标的局限性。

Details Motivation: 现有对VLA代理的评估依赖粗糙的任务成功率指标,无法精确诊断技能缺陷或衡量对真实世界扰动的稳健性,且数据碎片化阻碍了可复现研究和通用模型的开发。

Contribution: 1)提出了NEBULA,一个统一的支持单臂操作的评估生态系统;2)设计了双轴评估协议(能力测试和压力测试);3)提供了标准化API和大规模聚合数据集以减少碎片化。

Method: 通过能力测试精确诊断代理的技能(如空间推理),通过压力测试衡量其对扰动的稳健性。利用标准化API和数据集支持跨数据集训练和公平比较。

Result: 实验表明,优秀VLA代理在空间推理和动态适应等关键能力上表现不佳,而这些缺陷被传统任务成功率指标掩盖。

Insight: NEBULA揭示了传统评估指标的不足,强调了测量代理能力和可靠性的重要性,为构建稳健的通用代理提供了实用基础。

Abstract: The evaluation of Vision-Language-Action (VLA) agents is hindered by the coarse, end-task success metric that fails to provide precise skill diagnosis or measure robustness to real-world perturbations. This challenge is exacerbated by a fragmented data landscape that impedes reproducible research and the development of generalist models. To address these limitations, we introduce \textbf{NEBULA}, a unified ecosystem for single-arm manipulation that enables diagnostic and reproducible evaluation. NEBULA features a novel dual-axis evaluation protocol that combines fine-grained \textit{capability tests} for precise skill diagnosis with systematic \textit{stress tests} that measure robustness. A standardized API and a large-scale, aggregated dataset are provided to reduce fragmentation and support cross-dataset training and fair comparison. Using NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities such as spatial reasoning and dynamic adaptation, which are consistently obscured by conventional end-task success metrics. By measuring both what an agent can do and when it does so reliably, NEBULA provides a practical foundation for robust, general-purpose embodied agents.

[223] DINO-CVA: A Multimodal Goal-Conditioned Vision-to-Action Model for Autonomous Catheter Navigation

Pedram Fekri,Majid Roshanfar,Samuel Barbeau,Seyedfarzad Famouri,Thomas Looi,Dale Podolsky,Mehrdad Zadeh,Javad Dargahi

Main category: cs.RO

TL;DR: 本文提出了DINO-CVA,一种多模态目标条件的行为克隆框架,用于实现自主导管导航。该模型融合视觉观察和操纵杆运动学数据,通过目标条件指导导航,并在实验中验证了其高精度和可行性。

Details Motivation: 心脏导管手术仍高度依赖手动操作,现有机器人系统缺乏智能自主性,导致操作疲劳和结果不一致。本文旨在减少对操作员的依赖,并提高导管导航的可靠性。

Contribution: 提出了DINO-CVA框架,通过融合视觉和运动学数据,实现多模态目标条件的导管导航行为克隆,展示了其在自主导航中的潜力。

Method: 模型将视觉和运动学数据嵌入联合空间,以目标条件指导动作的自回归预测,并通过机器人实验和合成血管模型验证性能。

Result: DINO-CVA在动作预测上达到高精度,与仅基于运动学的基线相当,同时在解剖环境中实现了动作的语义基础。

Insight: 多模态和目标条件的架构对导管导航具有可行性,为减少操作依赖性和提高治疗可靠性提供了重要方向。

Abstract: Cardiac catheterization remains a cornerstone of minimally invasive interventions, yet it continues to rely heavily on manual operation. Despite advances in robotic platforms, existing systems are predominantly follow-leader in nature, requiring continuous physician input and lacking intelligent autonomy. This dependency contributes to operator fatigue, more radiation exposure, and variability in procedural outcomes. This work moves towards autonomous catheter navigation by introducing DINO-CVA, a multimodal goal-conditioned behavior cloning framework. The proposed model fuses visual observations and joystick kinematics into a joint embedding space, enabling policies that are both vision-aware and kinematic-aware. Actions are predicted autoregressively from expert demonstrations, with goal conditioning guiding navigation toward specified destinations. A robotic experimental setup with a synthetic vascular phantom was designed to collect multimodal datasets and evaluate performance. Results show that DINO-CVA achieves high accuracy in predicting actions, matching the performance of a kinematics-only baseline while additionally grounding predictions in the anatomical environment. These findings establish the feasibility of multimodal, goal-conditioned architectures for catheter navigation, representing an important step toward reducing operator dependency and improving the reliability of catheterbased therapies.

[224] DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment

Yu Gao,Yiru Wang,Anqing Jiang,Heng Yuwen,Wang Shuo,Sun Hao,Wang Jijun

Main category: cs.RO

TL;DR: DiffVLA++提出了一个结合认知推理和端到端规划的自动驾驶框架,通过度量引导的对齐方法整合VLA和E2E模块的优势,提升驾驶模型在长尾场景中的表现。

Details Motivation: 传统的端到端(E2E)驾驶模型虽然在生成物理可行性轨迹方面表现良好,但在长尾场景中泛化能力有限;而视觉语言行为(VLA)模型可以利用世界知识处理复杂场景,但3D推理能力不足可能导致物理不可行的行为。DiffVLA++旨在结合两者的优势。

Contribution: 1. 引入了一个VLA模块,生成语义驱动的驾驶轨迹;2. 设计了一个E2E模块,确保轨迹的物理可行性;3. 提出了一个度量引导的轨迹评分器,对齐VLA和E2E模块的输出。

Method: 1. 构建VLA模块生成语义驱动轨迹;2. 设计E2E模块并使用密集轨迹词汇表保证物理可行性;3. 通过度量引导评分器整合VLA和E2E模块的输出。

Result: 在ICCV 2025 Autonomous Grand Challenge排行榜上,DiffVLA++实现了49.12的EPDMS。

Insight: DiffVLA++展示了结合认知推理和端到端规划的潜力,通过度量对齐方法解决了VLA模型物理推理不足的问题,同时提升了E2E模型在复杂场景中的表现。

Abstract: Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.

[225] From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

Zhengshen Zhang,Hao Li,Yalun Dai,Zhengbang Zhu,Lei Zhou,Chenchen Liu,Dong Wang,Francis E. H. Tay,Sijin Chen,Ziwei Liu,Yuxiao Liu,Xinghang Li,Pan Zhou

Main category: cs.RO

TL;DR: FALCON提出了一种新范式,通过将丰富的3D空间标记注入动作头,解决了现有VLA模型在3D世界中因依赖2D编码器而存在的空间推理鸿沟问题。

Details Motivation: 现有VLA模型基于2D编码器在3D现实中运行,导致空间推理能力不足,限制了模型的泛化和适应性。

Contribution: FALCON通过空间基础模型从RGB图像中提取几何先验,并设计了一个可选的Embodied Spatial Model以融合深度或姿态信息,同时通过Spatial-Enhanced Action Head保留语言推理能力。

Method: FALCON利用空间基础模型生成3D空间标记,并通过独立的动作头处理这些标记,避免了与视觉-语言主干的直接耦合。

Result: FALCON在三个仿真基准和十一个真实任务中实现了SOTA性能,展现了强大的鲁棒性和适应性。

Insight: FALCON的成功表明,通过结构化注入空间信息并保持视觉-语言对齐,可以显著提升VLA模型的空间推理能力和多模态迁移性。

Abstract: Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.

[226] Botany-Bot: Digital Twin Monitoring of Occluded and Underleaf Plant Structures with Gaussian Splats

Simeon Adebola,Chung Min Kim,Justin Kerr,Shuangyu Xie,Prithvi Akella,Jose Luis Susa Rincon,Eugen Solowjow,Ken Goldberg

Main category: cs.RO

TL;DR: Botany-Bot提出了一种利用高斯泼溅模型和机器人技术构建植物数字孪生系统的方案,能够高精度监测被遮挡和叶片下部的植物结构。

Details Motivation: 商用植物表型系统因叶片遮挡无法观测植物细节,Botany-Bot旨在解决这一问题。

Contribution: 1) 提出了一种基于高斯泼溅模型的3D分割方法;2) 开发了机器人算法以操控叶片获取高分辨率图像。

Method: 使用立体相机、数字转台、工业机械臂和3D高斯泼溅模型构建系统,并通过机器人算法操控叶片。

Result: 叶片分割准确率90.8%,叶片检测86.2%,叶片操控77.9%,细节图像拍摄77.3%。

Insight: 高斯泼溅模型结合机器人操控能有效解决植物表型中的遮挡问题。

Abstract: Commercial plant phenotyping systems using fixed cameras cannot perceive many plant details due to leaf occlusion. In this paper, we present Botany-Bot, a system for building detailed “annotated digital twins” of living plants using two stereo cameras, a digital turntable inside a lightbox, an industrial robot arm, and 3D segmentated Gaussian Splat models. We also present robot algorithms for manipulating leaves to take high-resolution indexable images of occluded details such as stem buds and the underside/topside of leaves. Results from experiments suggest that Botany-Bot can segment leaves with 90.8% accuracy, detect leaves with 86.2% accuracy, lift/push leaves with 77.9% accuracy, and take detailed overside/underside images with 77.3% accuracy. Code, videos, and datasets are available at https://berkeleyautomation.github.io/Botany-Bot/.

cs.LG [Back]

[227] Can GRPO Help LLMs Transcend Their Pretraining Origin?

Kangqi Ni,Zhen Tan,Zijie Liu,Pingzhi Li,Tianlong Chen

Main category: cs.LG

TL;DR: GRPO算法在增强大型语言模型的推理能力方面表现不一致,研究从数据分布角度分析了其边界条件,发现GRPO仅能在目标任务与预训练偏差一致时提升泛化能力。

Details Motivation: 尽管GRPO被广泛用于提升LLM的推理能力,但其效果在不同领域表现不一致,研究旨在明确GRPO的改进条件和泛化能力边界。

Contribution: 从理论上证明了GRPO是一种保守的权重调整方案,仅能强化预训练偏差而不能发现全新解决方案;并通过实验验证了其在OOD任务中的局限性。

Method: 通过理论分析和控制实验(包括训练全新Transformer模型),评估了GRPO在不同推理深度、输入长度、符号表示和组合性任务中的表现。

Result: GRPO仅在目标任务与预训练偏差一致时表现出OOD提升;ID任务的增益随性能饱和而减弱,说明GRPO并非通用推理增强工具。

Insight: 研究揭示了GRPO的局限性,提出未来算法需进一步扩展模型超越预训练的能力,而非仅依赖强化预训练偏差。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), primarily driven by the Group Relative Policy Optimization (GRPO) algorithm, is a leading approach for enhancing the reasoning abilities of Large Language Models (LLMs). Despite its wide adoption, GRPO’s gains are often inconsistent; for instance, a model may show significant improvement in one reasoning domain, like mathematics, yet remain stagnant in another, such as medicine. This inconsistency raises a critical question: under what conditions does GRPO improve reasoning and generalize out-of-distribution (OOD)? We investigate this from a data distribution perspective. We first prove theoretically that GRPO is a conservative reweighting scheme, bounded by the base model’s distribution and thus unable to discover completely novel solutions. We further validate this in carefully designed controlled studies by training transformers from scratch, evaluating generalization across reasoning depth, input length, token representation, and compositionality. Our results provide a principled explanation for GRPO’s boundaries: OOD improvement emerges only when the target task aligns with the model’s pretrained biases, while gains on in-distribution (ID) tasks diminish as performance saturates. This reframes GRPO not as a universal reasoning enhancer but as a tool that sharpens pretraining biases. Our findings motivate future development of algorithms that can expand a model’s capabilities beyond its pretraining origin.

[228] Alignment is Localized: A Causal Probe into Preference Layers

Archie Chaudhury

Main category: cs.LG

TL;DR: 论文通过因果修补技术分析语言模型偏好优化的内部机制,揭示对齐过程是局部化、低秩的,主要由中间层决定。

Details Motivation: 人类反馈强化学习(RLHF)广泛用于语言模型对齐,但其内部工作机制尚不透明。本文旨在揭示RLHF对齐过程中的空间和层次特征。

Contribution: 首次系统性地通过因果修补技术分析语言模型对齐的局部化特性,发现对齐主要由中间层激活决定,且是低秩过程。

Method: 在Llama-3.2-1B模型上应用层间因果修补技术,结合LASSO回归分析激活距离与奖励增益的关系。

Result: 对齐行为主要集中在中层激活空间,早期和后期层不受影响;仅少数层对奖励增益有显著贡献。

Insight: 对齐过程是方向性、低秩的,而非全局扩散的,这为高效优化RLHF提供了理论支持。

Abstract: Reinforcement Learning frameworks, particularly those utilizing human annotations, have become an increasingly popular method for preference fine-tuning, where the outputs of a language model are tuned to match a certain set of behavioral policies or guidelines. Reinforcement Learning through Human Feedback (RLHF) is perhaps the most popular implementation of such a framework, particularly for aligning LMs toward safety and human intent. However, the internal workings of how such alignment is achieved remain largely opaque. In this work, we systematically analyze preference optimization for language model alignment by applying layer-wide causal patching between a base model and its tuned counterpart across human preference pairs. We implement our methodology on \textit{Llama-3.2-1B}, and find that alignment is spatially localized: mid-layer activations encode a distinct subspace that causally determines reward-consistent behavior, while early and late layers remain largely unaffected. Utilizing LASSO regression, we also find that only a small number of layers possess non-zero coefficients linking activation distances to reward gains. Overall, we show that, at least for some language models, alignment from human-based, preferential tuning is a directional, low rank process, rather than diffuse and parameteric.

[229] WEBSERV: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale

Yuxuan Lu,Jing Huang,Hui Liu,Jiri Gesi,Yan Han,Shihan Fu,Tianqi Zheng,Dakuo Wang

Main category: cs.LG

TL;DR: WEBSERV提出了一种高效、可扩展的浏览器-服务器环境,用于大规模训练强化学习(RL)网络代理,解决了现有环境在上下文噪声、非确定性行为和扩展性方面的不足。

Details Motivation: 现有RL网络代理的训练环境存在以下问题:上下文噪声过多、行为非确定性、无法高效扩展并行RL任务。WEBSERV旨在提供一个平衡上下文复杂性和可扩展性的解决方案。

Contribution: 1) 设计了一个紧凑、与网站无关的浏览器环境;2) 实现了高效启动和重置网络服务器的可扩展RL环境。

Method: 1) 简化浏览器环境以减少噪声;2) 通过快速启动和重置服务器实现并行RL任务的高效扩展。

Result: 在WebArena的购物CMS和Gitlab任务中,WEBSERV取得了最佳的成功率,同时将启动延迟降低5倍,存储需求减少240倍,支持单主机200+并发容器。

Insight: WEBSERV的设计表明,平衡上下文复杂性和扩展性是实现高效RL训练的关键。

Abstract: Training and evaluation of Reinforcement Learning (RL) web agents have gained increasing attention, yet a scalable and efficient environment that couples realistic and robust browser-side interaction with controllable server-side state at scale is still missing. Existing environments tend to have one or more of the following issues: they overwhelm policy models with excessive and noisy context; they perform actions non-deterministically without waiting for the UI or network to stabilize; or they cannot scale isolated client-server containers effectively for parallel RL rollouts. We propose WEBSERV, an environment that includes 1) a compact, site-agnostic browser environment that balances context and action complexity, and 2) a scalable RL environment via efficient launching and resetting web-servers to enable scalable RL training and evaluation. We evaluate WEBSERV on the shopping CMS and Gitlab tasks in WebArena, achieving state-of-the-art single-prompt success rates while cutting launch latency by ~5x and storage need by ~240x, with a comparable memory footprint, enabling 200+ concurrent containers on a single host.

[230] Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Bingqi Shang,Yiwei Chen,Yihua Zhang,Bingquan Shen,Sijia Liu

Main category: cs.LG

TL;DR: 这篇论文研究了大型语言模型(LLM)的反学习过程中可能存在的后门攻击问题,提出了一种基于注意力汇聚现象(attention sink)的后门反学习方法,能够在无触发时正常反学习,但在触发时恢复已被遗忘的知识。

Details Motivation: 随着开放权重的大型语言模型的普及,反学习的安全性成为一个重要问题。论文探索了反学习过程本身是否可能被后门攻击,即在表面成功的反学习下隐藏触发机制,从而在特定条件下恢复原始行为。

Contribution: 论文的主要贡献包括:(1)提出了后门反学习的概念;(2)揭示了注意力汇聚现象在后门攻击中的关键作用;(3)通过实验验证了基于注意力汇聚的后门反学习的有效性。

Method: 论文提出了一种基于注意力汇聚的后门反学习方法,通过将触发词放置在注意力汇聚位置(即浅层输入词),并调整其注意力值,显著增强了后门的持久性。

Result: 实验结果表明,基于注意力汇聚的后门反学习能够在触发时可靠地恢复被遗忘的知识,而在无触发时与正常反学习模型的行为无异。

Insight: 论文揭示了注意力汇聚现象是一种潜在的安全漏洞,强调了在设计和评估反学习机制时需要关注模型的内部注意力动态。

Abstract: Large language model (LLM) unlearning has become a critical mechanism for removing undesired data, knowledge, or behaviors from pre-trained models while retaining their general utility. Yet, with the rise of open-weight LLMs, we ask: can the unlearning process itself be backdoored, appearing successful under normal conditions yet reverting to pre-unlearned behavior when a hidden trigger is activated? Drawing inspiration from classical backdoor attacks that embed triggers into training data to enforce specific behaviors, we investigate backdoor unlearning, where models forget as intended in the clean setting but recover forgotten knowledge when the trigger appears. We show that designing such attacks presents unique challenges, hinging on where triggers are placed and how backdoor training is reinforced. We uncover a strong link between backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens consistently attract disproportionate attention in LLMs. Our analysis reveals that these attention sinks serve as gateways for backdoor unlearning: placing triggers at sink positions and aligning their attention values markedly enhances backdoor persistence. Extensive experiments validate these findings, showing that attention-sink-guided backdoor unlearning reliably restores forgotten knowledge in the presence of backdoor triggers, while behaving indistinguishably from a normally unlearned model when triggers are absent. Code is available at https://github.com/OPTML-Group/Unlearn-Backdoor.

[231] Do LLMs Recognize Your Latent Preferences? A Benchmark for Latent Information Discovery in Personalized Interaction

Ioannis Tsaknakis,Bingqing Song,Shuyu Gan,Dongyeop Kang,Alfredo Garcia,Gaowen Liu,Charles Fleming,Mingyi Hong

Main category: cs.LG

TL;DR: 该论文提出了一个统一的基准(Benchmark),用于评估大型语言模型(LLMs)在多轮交互中发现和利用用户潜在信息的能力,发现其表现因任务复杂性和隐藏属性数量而异。

Details Motivation: LLMs在生成通用文本方面表现出色,但在需要用户个性化偏好的场景中,用户的许多偏好是潜在的,需要模型通过对话推断。当前缺乏系统性评估LLMs发现潜在信息能力的研究。

Contribution: 提出了首个统一的基准,用于评估LLMs在多轮交互中发现和利用潜在用户偏好的能力,涵盖了三种渐进式现实场景。

Method: 采用三代理框架(用户、助手、评委),设计了三类任务:20 Questions游戏、个性化问答和个性化文本摘要,并通过多轮对话评估模型的潜在信息发现能力。

Result: LLMs确实可以通过对话揭示潜在信息,但其成功率在32%到98%之间波动,取决于任务复杂性、主题和隐藏属性的数量。

Insight: 模型的表现表明,有效的偏好推断仍是一个开放的挑战,尤其在任务复杂性和隐藏属性较多时,仍需进一步研究以构建真正自适应的AI系统。

Abstract: Large Language Models (LLMs) excel at producing broadly relevant text, but this generality becomes a limitation when user-specific preferences are required, such as recommending restaurants or planning travel. In these scenarios, users rarely articulate every preference explicitly; instead, much of what they care about remains latent, waiting to be inferred. This raises a fundamental question: Can LLMs uncover and reason about such latent information through conversation? We address this problem by introducing a unified benchmark for evaluating latent information discovery - the ability of LLMs to reveal and utilize hidden user attributes through multi-turn interaction. The benchmark spans three progressively realistic settings: the classic 20 Questions game, Personalized Question Answering, and Personalized Text Summarization. All tasks share a tri-agent framework (User, Assistant, Judge) enabling turn-level evaluation of elicitation and adaptation. Our results reveal that while LLMs can indeed surface latent information through dialogue, their success varies dramatically with context: from 32% to 98%, depending on task complexity, topic, and number of hidden attributes. This benchmark provides the first systematic framework for studying latent information discovery in personalized interaction, highlighting that effective preference inference remains an open frontier for building truly adaptive AI systems.

[232] LILO: Bayesian Optimization with Interactive Natural Language Feedback

Katarzyna Kobalczyk,Zhiyuan Jerry Lin,Benjamin Letham,Zhuokai Zhao,Maximilian Balandat,Eytan Bakshy

Main category: cs.LG

TL;DR: 论文提出了一种结合语言模型的贝叶斯优化框架LILO,通过自然语言反馈灵活地将用户主观目标转化为可量化优化目标,优于传统方法和纯语言模型优化器。

Details Motivation: 现实中许多任务的优化目标复杂或主观,现有方法难以直接量化。传统贝叶斯优化(BO)需要结构化反馈,限制了灵活性;而纯语言模型优化器缺乏样本效率和不确定性量化。

Contribution: 1) 提出语言循环框架LILO,利用大语言模型(LLM)将自然语言反馈转为统一效用信号;2) 结合BO的样本效率与LLM的灵活性;3) 在反馈有限的场景下表现优于基准方法。

Method: 通过LLM将非结构化的自然语言反馈转化为标量效用,输入BO进行优化。无需手动设计核函数,支持灵活的用户先验。

Result: 实验表明,LILO在自然交互和优化效果上均优于传统BO和纯LLM优化器,尤其在反馈受限时优势显著。

Insight: LLM与BO的结合能同时解决灵活性和效率问题,为复杂目标的优化提供了新思路。

Abstract: For many real-world applications, feedback is essential in translating complex, nuanced, or subjective goals into quantifiable optimization objectives. We propose a language-in-the-loop framework that uses a large language model (LLM) to convert unstructured feedback in the form of natural language into scalar utilities to conduct BO over a numeric search space. Unlike preferential BO, which only accepts restricted feedback formats and requires customized models for each domain-specific problem, our approach leverages LLMs to turn varied types of textual feedback into consistent utility signals and to easily include flexible user priors without manual kernel design. At the same time, our method maintains the sample efficiency and principled uncertainty quantification of BO. We show that this hybrid method not only provides a more natural interface to the decision maker but also outperforms conventional BO baselines and LLM-only optimizers, particularly in feedback-limited regimes.

[233] Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity

Simon Jaxy,Anton Theys,Patrick Willett,W. Chris Carleton,Ralf Vandam,Pieter Libin

Main category: cs.LG

TL;DR: 该论文提出了一种半监督学习结合动态伪标记的深度学习方法,用于解决考古遗址预测中标签稀缺的问题。

Details Motivation: 考古遗址预测模型中,已知的正样本极少且大多数区域未被标记,导致标签稀缺问题。传统的监督学习方法难以应对这种挑战。

Contribution: 1. 提出了一种结合半监督学习和正样本-未标记样本(PU)学习的策略;2. 使用动态伪标记和条件随机场(CRF)提升标签置信度;3. 在两个考古数据集上验证了方法的有效性。

Method: 1. 采用语义分割模型;2. 动态伪标记结合CRF(通过RNN实现)优化标签质量;3. 在数字高程模型(DEM)和卫星图像数据上进行评估。

Result: 1. 在DEM数据上性能与SOTA方法LAMAP相当,但Dice分数更高;2. 在卫星图像上保持性能并提升结果可解释性。

Insight: 半监督学习为解决考古遗址预测中的标签稀缺问题提供了可行方案,尤其适用于大规模稀疏标注的场景。

Abstract: Archaeological predictive modelling estimates where undiscovered sites are likely to occur by combining known locations with environmental, cultural, and geospatial variables. We address this challenge using a deep learning approach but must contend with structural label scarcity inherent to archaeology: positives are rare, and most locations are unlabeled. To address this, we adopt a semi-supervised, positive-unlabeled (PU) learning strategy, implemented as a semantic segmentation model and evaluated on two datasets covering a representative range of archaeological periods. Our approach employs dynamic pseudolabeling, refined with a Conditional Random Field (CRF) implemented via an RNN, increasing label confidence under severe class imbalance. On a geospatial dataset derived from a digital elevation model (DEM), our model performs on par with the state-of-the-art, LAMAP, while achieving higher Dice scores. On raw satellite imagery, assessed end-to-end with stratified k-fold cross-validation, it maintains performance and yields predictive surfaces with improved interpretability. Overall, our results indicate that semi-supervised learning offers a promising approach to identifying undiscovered sites across large, sparsely annotated landscapes.

[234] Domain Generalizable Continual Learning

Hongwei Yan,Guanglong Sun,Zhiqi Kang,Yi Zhong,Liyuan Wang

Main category: cs.LG

TL;DR: 该论文提出了一个新的学习设置——域广义持续学习(DGCL),并提出了一种名为自适应域变换(DoT)的创新方法,旨在解决模型在动态环境中学习并泛化到未见域的问题。

Details Motivation: 现实世界中的智能系统需要持续学习新任务并泛化到多样化、未见过的场景。现有的持续学习方法通常假设训练和测试域相同,无法满足DGCL的需求。

Contribution: 1. 提出了DGCL这一新的学习设置;2. 设计了DoT方法,通过解耦语义和域相关信息并自适应变换任务表示,实现广义预测。

Method: DoT基于预训练模型,解耦语义和域信息,并通过自适应变换任务表示对齐输出,支持高效参数调优。

Result: 实验验证了DoT在DGCL中显著提升了现有持续学习方法的性能,同时保持了轻量级实现和资源效率。

Insight: 通过模仿人脑的分布式-中心理论,DoT在表示学习中实现了语义和域信息的有效解耦,为动态环境中的广义学习提供了新思路。

Abstract: To adapt effectively to dynamic real-world environments, intelligent systems must continually acquire new skills while generalizing them to diverse, unseen scenarios. Here, we introduce a novel and realistic setting named domain generalizable continual learning (DGCL): a model learns sequential tasks with each involving a single domain, aiming to perform well across all encountered tasks and domains. This setting poses unique challenges in acquiring, retaining, and leveraging both semantic- and domain-relevant information for robust generalization. Although state-of-the-art continual learning (CL) methods have employed pre-trained models (PTMs) to enhance task-specific generalization, they typically assume identical training and testing domains for each task and therefore perform poorly in DGCL. To this end, we propose adaptive Domain Transformation (DoT), an innovative PTMs-based approach tailored to DGCL. Inspired by the distributed-plus-hub theory of the human brain, DoT disentangles semantic- and domain-relevant information in representation learning, and adaptively transforms task representations across various domains for output alignment, ensuring balanced and generalized predictions. DoT serves as a plug-in strategy that greatly facilitates state-of-the-art CL baselines under both full parameter tuning and parameter-efficient tuning paradigms in DGCL, validated by extensive experiments. Also, DoT is shown to accumulate domain-generalizable knowledge from DGCL, and ensure resource efficiency with a lightweight implementation.

[235] Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation

Rishi Sonthalia,Raj Rao Nadakuditi

Main category: cs.LG

TL;DR: 本文提出了一种基于矩阵自由能的自动编码器正则化方法,通过优化代码矩阵的奇异值分布,生成高斯化代码,应用于欠定逆问题。

Details Motivation: 现有自动编码器生成的代码通常缺乏高斯性,限制了其在欠定逆问题中的应用。

Contribution: 提出矩阵自由能作为正则化项,定义可微损失函数,优化代码矩阵的奇异值分布,使其更接近高斯分布。

Method: 利用随机矩阵理论和自由概率理论,定义基于奇异值的损失函数,并通过标准随机梯度下降训练,最小化负矩阵自由能。

Result: 实验表明,该方法能生成高斯化代码,并在训练和测试集上表现良好,成功应用于欠定逆问题。

Insight: 矩阵自由能作为一种正则化方法,可以显著提升自动编码器的性能,尤其是在高斯化要求和欠定问题中。

Abstract: We introduce a novel regularization scheme for autoencoders based on matricial free energy. Our approach defines a differentiable loss function in terms of the singular values of the code matrix (code dimension x batch size). From the standpoint of free probability an d random matrix theory, this loss achieves its minimum when the singular value distribution of the code matrix coincides with that of an appropriately sculpted random metric with i.i.d. Gaussian entries. Empirical simulations demonstrate that minimizing the negative matricial free energy through standard stochastic gradient-based training yields Gaussian-like codes that generalize across training and test sets. Building on this foundation, we propose a matricidal free energy maximizing autoencoder that reliably produces Gaussian codes and show its application to underdetermined inverse problems.

[236] MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning

Alejandro Guerra-Manzanares,Farah E. Shamout

Main category: cs.LG

TL;DR: 论文提出了MILES(Modality-Informed Learning ratE Scheduler),一种动态调整学习率的调度器,旨在平衡多模态学习中的模态过拟合问题,提升多模态与单模态任务的性能。

Details Motivation: 多模态神经网络在训练过程中常出现模态过拟合问题,导致模型过度依赖某一模态而性能次优,限制了多模态学习的潜力。

Contribution: 提出了MILES调度器,通过动态调整学习率平衡多模态训练中的模态利用效率,显著提升多模态与单模态任务的性能。

Method: MILES利用训练过程中模态条件利用率差异,动态调整学习率,平衡不同模态的学习速度。

Result: 在四个多模态联合融合任务中,MILES表现优于七种现有方法,显著平衡了模态使用并提升了性能。

Insight: 平衡多模态学习中各模态的训练速度对提升整体模型性能至关重要,同时还可以增强单模态编码器的表现。

Abstract: The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.

[237] ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification

Athanasios Angelakis,Amne Mousa,Micah L. A. Heldeweg,Laurens A. Biesheuvel,Mark A. Haaksma,Jasper M. Smit,Pieter R. Tuinman,Paul W. G. Elbers

Main category: cs.LG

TL;DR: ZACH-ViT是一种轻量级Vision Transformer,通过去除位置嵌入和分类令牌实现排列不变性,结合ShuffleStrides数据增强提升泛化能力,在肺部超声分类任务中表现优异,超越了现有方法。

Details Motivation: 肺部超声数据的高度视觉变异性(如非心源性炎症模式)使得自动化分类极具挑战性。现有方法在处理无序医学图像时表现不佳,亟需一种高效的轻量级模型。

Contribution: 1. 提出ZACH-ViT,一种0.25M参数的Vision Transformer变体,去除位置嵌入和分类令牌,实现排列不变性;2. 提出ShuffleStrides数据增强方法,增强模型泛化能力;3. 在小数据医学图像任务中证明了架构设计优于规模扩展。

Method: 1. ZACH-ViT移除位置嵌入和分类令牌,仅保留注意力机制;2. SSDA通过置换探头视角序列和帧顺序生成多样训练数据;3. 与其他9种SOTA模型在380个LUS视频上进行对比实验。

Result: ZACH-ViT在验证集和测试集上分别达到0.80和0.79的ROC-AUC,灵敏度0.60,特异性0.91,训练速度比其他ViT快1.35倍,参数减少2.5倍。

Insight: 在小数据医学图像任务中,轻量级模型通过合理的设计(如排列不变性和数据增强)可以显著超越复杂模型,凸显了模型结构与数据特性的匹配重要性。

Abstract: Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.

cs.CY [Back]

[238] Attention to Non-Adopters

Kaitlyn Zhou,Kristina Gligorić,Myra Cheng,Michelle S. Lam,Vyoma Raman,Boluwatife Aminu,Caeley Woo,Michael Brockman,Hannah Cha,Dan Jurafsky

Main category: cs.CY

TL;DR: 这篇观点论文主张在大型语言模型(LLM)的开发中关注非使用者的需求,以避免不平等和忽视重要任务,并通过案例研究展示了非使用者的需求与现有用户的差异。

Details Motivation: 当前LLM的发展和评估主要依赖使用者的数据,忽视了非使用者的需求,可能导致模型能力和受益范围受限。

Contribution: 提出非使用者的视角对LLM开发的重要性,并通过案例研究展示其需求差异及对新型推理任务的启示。

Method: 通过人类中心化方法(human-centered methods)与非使用者互动,分析其需求并探讨如何系统整合这些需求。

Result: 研究发现非使用者的需求与现有用户显著不同,并为LLM开发提供了新的任务方向。

Insight: 关注非使用者有助于开发更具包容性和广泛用途的LLM,避免技术不平等和社会忽视。

Abstract: Although language model-based chat systems are increasingly used in daily life, most Americans remain non-adopters of chat-based LLMs – as of June 2025, 66% had never used ChatGPT. At the same time, LLM development and evaluation rely mainly on data from adopters (e.g., logs, preference data), focusing on the needs and tasks for a limited demographic group of adopters in terms of geographic location, education, and gender. In this position paper, we argue that incorporating non-adopter perspectives is essential for developing broadly useful and capable LLMs. We contend that relying on methods that focus primarily on adopters will risk missing a range of tasks and needs prioritized by non-adopters, entrenching inequalities in who benefits from LLMs, and creating oversights in model development and evaluation. To illustrate this claim, we conduct case studies with non-adopters and show: how non-adopter needs diverge from those of current users, how non-adopter needs point us towards novel reasoning tasks, and how to systematically integrate non-adopter needs via human-centered methods.

cs.MM [Back]

[239] Taming Modality Entanglement in Continual Audio-Visual Segmentation

Yuyang Hong,Qi Yang,Tao Zhang,Zili Wang,Zhaojin Fu,Kun Ding,Bin Fan,Shiming Xiang

Main category: cs.MM

TL;DR: 该论文提出了一个新颖的持续音频-视觉分割任务(CAVS),通过设计CMR框架解决了多模态语义漂移和共现混淆问题。

Details Motivation: 当前多模态持续学习方法主要集中在粗粒度任务上,而在细粒度任务中处理模态纠缠问题时存在局限性。因此,论文提出了CAVS任务,旨在通过音频引导持续分割新类别。

Contribution: 论文的主要贡献包括:1)引入了CAVS任务;2)提出了CMR框架,解决多模态语义漂移和共现混淆问题;3)构建了三种音频-视觉增量场景以验证方法的有效性。

Method: CMR框架包括多模态样本选择(MSS)策略和基于碰撞的样本复述(CSR)机制。MSS选择模态一致性高的样本用于复述,CSR则增加易混淆类别的复述样本频率。

Result: 实验表明,该方法显著优于单模态持续学习方法。

Insight: 模态一致性和混淆类别的样本复述是解决多模态持续学习问题的关键。

Abstract: Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.

eess.IV [Back]

[240] Time-Embedded Algorithm Unrolling for Computational MRI

Junno Yun,Yaşar Utku Alçalar,Mehmet Akçakaya

Main category: eess.IV

TL;DR: 本文提出了一种基于时间嵌入的算法展开方法,用于计算MRI中的逆问题,通过引入时间依赖性网络和参数,显著提升了图像重建质量。

Details Motivation: 传统的算法展开方法在计算MRI中共享网络可能导致伪影或模糊,而使用独立网络会增加参数量且易过拟合。本文受AMP和扩散模型的启发,提出时间嵌入策略以解决这些问题。

Contribution: 1. 提出了一种时间嵌入的算法展开框架,将近端算子和数据保真操作的时间依赖性显式建模。2. 通过在VAMP中引入时间依赖性视角,改进了Onsager校正和数据保真操作的参数学习。3. 在fastMRI数据集上展示了卓越的性能提升。

Method: 1. 将迭代依赖性近端算子(如VAMP)和数据保真操作的标量权重建模为时间依赖性网络和参数。2. 利用Onsager校正的显式时间嵌入优化重建过程。3. 在算法展开中结合扩散模型的时间嵌入思想。

Result: 在fastMRI数据集的多个加速率和子集上,该方法有效减少了混叠伪影和噪声放大,达到了最先进的性能。

Insight: 1. 时间嵌入策略可以灵活应用于现有算法展开方法,提升重建质量且不显著增加计算负担。2. 迭代依赖性参数设计是关键,能够平衡重建精度和计算效率。

Abstract: Algorithm unrolling methods have proven powerful for solving the regularized least squares problem in computational magnetic resonance imaging (MRI). These approaches unfold an iterative algorithm with a fixed number of iterations, typically alternating between a neural network-based proximal operator for regularization, a data fidelity operation and auxiliary updates with learnable parameters. While the connection to optimization methods dictate that the proximal operator network should be shared across unrolls, this can introduce artifacts or blurring. Heuristically, practitioners have shown that using distinct networks may be beneficial, but this significantly increases the number of learnable parameters, making it challenging to prevent overfitting. To address these shortcomings, by taking inspirations from proximal operators with varying thresholds in approximate message passing (AMP) and the success of time-embedding in diffusion models, we propose a time-embedded algorithm unrolling scheme for inverse problems. Specifically, we introduce a novel perspective on the iteration-dependent proximal operation in vector AMP (VAMP) and the subsequent Onsager correction in the context of algorithm unrolling, framing them as a time-embedded neural network. Similarly, the scalar weights in the data fidelity operation and its associated Onsager correction are cast as time-dependent learnable parameters. Our extensive experiments on the fastMRI dataset, spanning various acceleration rates and datasets, demonstrate that our method effectively reduces aliasing artifacts and mitigates noise amplification, achieving state-of-the-art performance. Furthermore, we show that our time-embedding strategy extends to existing algorithm unrolling approaches, enhancing reconstruction quality without increasing the computational complexity significantly.

cs.SD [Back]

[241] U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

Xusheng Yang,Long Zhou,Wenfu Wang,Kai Hu,Shulin Feng,Chenxing Li,Meng Yu,Dong Yu,Yuexian Zou

Main category: cs.SD

TL;DR: U-Codec是一种超低帧率(5Hz)神经语音编解码器,通过Transformer模块和优化的RVQ配置实现高效语音合成,同时提升推理速度3倍。

Details Motivation: 传统高帧率编解码器在低帧率下会导致语音清晰度和频谱细节损失,需要一种能在极低帧率下保持高保真度的解决方案。

Contribution: 1. 提出U-Codec,支持5Hz超低帧率;2. 引入Transformer长时依赖模块;3. 扩展LLM-TTS至32层RVQ。

Method: 结合Transformer捕捉长时依赖,系统探索RVQ深度和码本大小优化配置,并在LLM-TTS模型中应用分层架构。

Result: U-Codec在5Hz下保持相似性和自然度,推理速度提升3倍。

Insight: 极低帧率的离散令牌可用于高效语音合成,为实时应用提供新可能。

Abstract: We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $\times$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.

[242] Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

Bo-Han Feng,Chien-Feng Liu,Yu-Hsuan Li Liang,Chih-Kai Yang,Szu-Wei Fu,Zhehuai Chen,Ke-Han Lu,Sung-Feng Huang,Chao-Han Huck Yang,Yu-Chiang Frank Wang,Yun-Nung Chen,Hung-yi Lee

Main category: cs.SD

TL;DR: 该论文研究了大型音频-语言模型(LALMs)在说话者情感变化下的安全性漏洞,发现不同情感和强度的语音指令会导致模型生成不安全回答,呼吁设计更具鲁棒性的对齐策略。

Details Motivation: 尽管LALMs在多模态应用中展现出潜力,但其在副语言变化(如情感表达)下的安全性对齐问题尚未被充分研究。该论文旨在填补这一空白,探讨情感变化对模型安全性的影响。

Contribution: 论文的主要贡献包括构建了一个包含恶意语音指令和情感变化的数据集,并揭示了LALMs在不同情感和强度下安全性的不一致性,为未来模型对齐提供了新方向。

Method: 研究方法包括构建数据集(恶意指令+多情感/强度组合),并评估多个前沿LALMs的安全性表现,分析情感和强度对模型输出的影响。

Result: 结果表明,不同情感和强度会导致模型安全性的显著差异,其中中等强度的情感表达风险最高。这暴露了现有对齐策略的不足。

Insight: 情感变化是LALMs安全性评估中被忽视的因素,未来的对齐策略需专门考虑情感鲁棒性,以确保模型在真实场景中的可信部署。

Abstract: Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications. While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation remains underexplored. This work systematically investigates the role of speaker emotion. We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs. Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk. These findings highlight an overlooked vulnerability in LALMs and call for alignment strategies explicitly designed to ensure robustness under emotional variation, a prerequisite for trustworthy deployment in real-world settings.

[243] SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models

Chih-Kai Yang,Yen-Ting Piao,Tzu-Wen Hsu,Szu-Wei Fu,Zhehuai Chen,Ke-Han Lu,Sung-Feng Huang,Chao-Han Huck Yang,Yu-Chiang Frank Wang,Yun-Nung Chen,Hung-yi Lee

Main category: cs.SD

TL;DR: SAKE是首个专注于大型音频语言模型(LALMs)中听觉属性知识编辑的基准测试,提出了针对抽象听觉属性的知识编辑挑战。

Details Motivation: 现有知识编辑研究主要集中于文本或视觉模态,忽视了听觉模态的独特性和重要性,SAKE填补了这一空白。

Contribution: 1. 提出首个听觉属性知识编辑基准SAKE;2. 在两种LALMs上测试了七种编辑方法,定义了四个评估维度;3. 揭示了听觉知识编辑的独特挑战。

Method: 1. 设计抽象听觉属性的知识编辑任务;2. 从可靠性、通用性、音频/文本局部性和可移植性四个维度评估编辑方法;3. 使用两种LALMs验证。

Result: 实验表明,听觉知识编辑面临挑战,如保留无关知识、多模态推理中的泛化以及序列更新的维护。

Insight: SAKE为研究知识编辑扩展到听觉模态提供了框架,推动了LALMs在多样化场景中的适应和维护。

Abstract: Knowledge editing offers an efficient way to update model knowledge without full retraining, but prior work has concentrated almost exclusively on textual or visual modalities. We introduce SAKE, the first benchmark specifically designed for editing auditory attribute knowledge in Large Audio-Language Models (LALMs). Unlike factual updates, SAKE targets several abstract auditory attributes, capturing knowledge types that go beyond conventional textual and visual domains. We benchmark seven editing methods on two LALMs along four dimensions: reliability, generality, audio/text locality, and portability. Results highlight challenges such as preserving intra-attribute knowledge unrelated to the edit, generalizing edits to multimodal reasoning, and maintaining edits under sequential updates. SAKE provides a principled framework to study how knowledge editing extends to the auditory modalities, opening new directions for maintaining and adapting LALMs in more diverse real-world scenarios.

[244] DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model

Massa Baali,Rita Singh,Bhiksha Raj

Main category: cs.SD

TL;DR: DELULU是一个基于自监督学习的语音基础模型,通过引入外部监督信号改进伪标签生成,显著提升了说话人相关任务的性能。

Details Motivation: 现有的自监督语音模型在内容驱动任务上表现优异,但缺乏对说话人特征的有效捕捉,影响了其在验证、分割和刻画任务中的表现。

Contribution: 提出了DELULU模型,通过引入ReDimNet的帧级嵌入信息改进k-means聚类过程,增强了说话人特征的区分性。此外,结合掩码预测和去噪的双重目标提升了模型的鲁棒性和泛化能力。

Method: 1. 利用ReDimNet的帧级嵌入指导k-means聚类,引入说话人区分性偏置;2. 结合掩码预测和去噪的双重目标进行预训练;3. 在多个说话人中心任务上验证性能。

Result: DELULU在说话人验证任务中EER相对提升62%,并在性别、年龄、口音等零样本刻画任务上表现优异,无需任务微调即可实现通用编码。

Insight: 通过融合外部监督信号优化伪标签生成,可以显著提升自监督模型在说话人相关任务中的表现,同时无需牺牲其通用性。

Abstract: Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.